File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1167_metho.xml
Size: 15,011 bytes
Last Modified: 2025-10-06 14:08:48
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1167"> <Title>Statistical Language Modeling with Performance Benchmarks using Various Levels of Syntactic-Semantic Information</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Syntactically Enhanced LSA </SectionTitle> <Paragraph position="0"> Latent semantic analysis (LSA) is a statistical, algebraic technique for extracting and inferring relations of expected contextual usage of words in documents (Landauer et al., 1998). It is based on word-document co-occurrence statistics, and thus is often called a 'bag-of-words' approach. It neglects the word-order or syntactic information in a language which if properly incorporated, can lead to better language modeling. In an effort to include syntactic information in the LSA framework, we have developed a model which characterizes a word's behavior across various syntactic and semantic contexts. This can be achieved by augmenting a word with its syntactic descriptor and considering as a unit of knowledge representation. The resultant LSA-like analysis is termed as syntactically enhanced latent semantic analysis (SELSA). This approach can better model the finer resolution in a word's usage compared to an average representation by LSA. This finer resolution can be used to better discriminate semantically ambiguous sentences for cognitive modeling as well as to predict a word using syntactic-semantic history for language modeling. We explain below, a step-by-step procedure for building this model.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Word-Tag-Document Structure </SectionTitle> <Paragraph position="0"> The syntactic description of a word can be in many forms like part-of-speech tag, phrase type or supertags. In the description hereafter we call any such syntactic information as tag of the word. Now, consider a tagged training corpus of sufficient size in the domain of interest. The first step is to construct a matrix whose rows correspond to word-tag pairs and columns correspond to documents in the corpus. A document can be a sentence, a paragraph or a larger unit of text. If the vocabulary V consists of I words, tagset T consists of J tags and the number of documents in corpus is K, then the matrix will be IJ x K. Let ci j,k denote the frequency of word wi with tag tj in the document dk. The notation i j (i underscore j) in subscript is used for convenience and indicates word wi with tag tj i.e., (i [?] 1)J + jth row of the matrix. Then we find entropy ei j of each word-tag pair and scale the corresponding row of the matrix by (1[?]ei j). The document length normalization to each column of the matrix is also applied by dividing the entries of kth document by nk, the number of words in document dk. Let ci j be the frequency of i jth word-tag pair in the whole corpus i.e. ci j = summationtextKk=1 ci j,k. Then,</Paragraph> <Paragraph position="2"> Once the matrix X is obtained, we perform its singular value decomposition (SVD) and approximate it by keeping the largest R singular values and setting the rest to zero. Thus,</Paragraph> <Paragraph position="4"> where, U(IJ xR) and V(KxR) are orthonormal matrices and S(RxR) is a diagonal matrix. It is this dimensionality reduction step through SVD that captures major structural associations between words-tags and documents, removes 'noisy' observations and allows the same dimensional representation of words-tags and documents (albeit, in different bases).This same dimensional representation is used (eq. 12) to find syntactic-semantic correlation between the present word and the history of words and then to derive the language model probabilities. This R-dimensional space can be called either syntactically enhanced latent semantic space or latent syntactic-semantic space.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Document Projection in SELSA Space </SectionTitle> <Paragraph position="0"> After the knowledge is represented in the latent syntactic-semantic space, we can project any new document as a R dimensional vector -vselsa in this space. Let the new document consist of a word sequence wi1,wi2,...,win and let the corresponding tag sequence be tj1,tj2,...,tjn, where ip and jp are the indices of the pth word and its tag in the vocabulary V and the tagset T respectively. Let d be the IJ x1 vector representing this document whose elements di j are the frequency counts i.e. number of times word wi occurs with tag pj, weighted by its corresponding entropy measure (1[?]ei j). It can be thought of as an additional column in the matrix X, and therefore can be thought of as having its corresponding vector v in the matrix V.</Paragraph> <Paragraph position="1"> Then, d = USvT and</Paragraph> <Paragraph position="3"> which is a 1xR dimensional vector representation of the document in the latent space. Here uip jp represents the row vector of the SELSA U matrix corresponding to the word wip and tag tjp in the current document.</Paragraph> <Paragraph position="4"> We can also define a syntactic-semantic similarity measure between any two text documents as the cosine of the angle between their projection vectors in the latent syntactic-semantic space. With this measure we can address the problems that LSA has been applied to, namely natural language understanding, cognitive modeling, statistical language modeling etc.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Statistical Language Modeling </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> using SELSA 3.1 Framework </SectionTitle> <Paragraph position="0"> We follow the framework in (Bangalore, 1996) to define a class-based language model where classes are defined by the tags. Here probability of a sequence Wn of n words is given by</Paragraph> <Paragraph position="2"> where ti is a tag variable for the word wi. To compute this probability in realtime based on local information, we make certain assumptions:</Paragraph> <Paragraph position="4"> where probability of a word is calculated by renormalizing the tri-gram probability to those words which are compatible with the tag in context. Similarly, tag probability is modeled using a bi-gram model. Other models like tag based likelihood probability of a word or tag tri-grams can also be used. Similarly there is a motivation for using the syntactically enhanced latent semantic analysis method to derive the word probability given the syntax of tag and semantics of word-history.</Paragraph> <Paragraph position="5"> The calculation of perplexity is based on conditional probability of a word given the word history, which can be derived in the following manner using recursive computation.</Paragraph> <Paragraph position="7"> A further reduction in computation is achieved by restricting the summation over only those tags which the target word can anchor. A similar expression using the tag tri-gram model can be derived which includes double summation. The efficiency of this model depends upon the prediction of tag tq using the word history Wq[?]1. When the target tag is correctly known, we can derive a performance benchmark in terms of lower bound on the perplexity achievable. Furthermore, if we assume tagged corpus, then tq's and Tq's become deterministic variables and (5) and (7) can be written as,</Paragraph> <Paragraph position="9"> respectively in which case the next described SELSA language model can be easily applied to calculate the benchmarks.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 SELSA Language Model </SectionTitle> <Paragraph position="0"> SELSA model using tag information for each word can also be developed and used along the line of LSA based language model. We can observe in the above framework the need for the probability of the form P(wq|tq,Wq[?]1,Tq[?]1) which can be evaluated using the SELSA representation of the word-tag pair corresponding to wq and tq and the history Wq[?]1Tq[?]1. The former is given by the row uiq jq of SELSA U matrix and the later can be projected onto the SELSA space as a vector ~-vq[?]1 using (4). The length of history can be tapered to reduce the effect of far distant words using the exponential forgetting factor 0 < l < 1 as below:</Paragraph> <Paragraph position="2"> The next step is to calculate the cosine measure reflecting the syntactic-semantic 'closeness' between the word wq and the history Wq[?]1 as below:</Paragraph> <Paragraph position="4"> total probability mass in proportion to this closeness measure such that least likely word has a probability of 0 and all probabilities sum</Paragraph> <Paragraph position="6"> But this results in a very limited dynamic range for SELSA probabilities which leads to poor performance. This is alleviated by raising the above derived probability to a power g > 1 and then normalizing as follows(Coccaro and Jurafsky, 1998):</Paragraph> <Paragraph position="8"> This probability gives more importance to the large span syntactic-semantic dependencies and thus would be higher for those words which are syntactic-semantically regular in the recent history as compared to others. But it will not predict very well certain locally regular words like of, the etc whose main role is to support the syntactic structure in a sentence. On the other hand, n-gram language models are able to model them well because of maximum likelihood estimation from training corpus and various smoothing techniques. So the best performance can be achieved by integrating the two.</Paragraph> <Paragraph position="9"> One way to derive the 'SELSA + N-gram' joint probability P(sel+ng)(wq|Wq[?]1) is to use the geometric mean based integration formula given for LSA in (Coccaro and Jurafsky, 1998) as follows: null</Paragraph> <Paragraph position="11"> where, xiq = 1[?]eiq jq2 and xi = 1[?]ei jq2 are the geometric mean weights for SELSA probabilities for the current word wq and any word wi [?] V respectively.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Various Levels of Syntactic </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Information </SectionTitle> <Paragraph position="0"> In this section we explain various levels of syntactic information that can be incorporated within SELSA framework. They are supertags, phrase type and content/fuction word type.</Paragraph> <Paragraph position="1"> These are in decreasing order of complexity and provide finer to coarser levels of syntactic information. null</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Supertags </SectionTitle> <Paragraph position="0"> Supertags are the elementary structures of Lexicalized Tree Adjoining Grammars (LTAGs) (Bangalore and Joshi, 1999). They are combined by the operations of substitution and adjunction to yield a parse for the sentence. Each supertag is lexicalized i.e. associated with at least one lexical item - the anchor. Further, all the arguments of the anchor of a supertag are localized within the same supertag which allows the anchor to impose syntactic and semantic (predicate-argument) constraints directly on its arguments. As a result, a word is typically associated with one supertag for each syntactic configuration the word may appear in. Supertags can be seen as providing a much more refined set of classes than do part-of-speech tags and hence we expect supertag-based language models to be better than part-of-speech based language models.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Phrase-type </SectionTitle> <Paragraph position="0"> Words in a sentence are not just strung together as a sequence of parts of speech, but rather they are organized into phrases, grouping of words that are clumped as a unit. A sentence normally rewrites as a subject noun phrase (NP) and a verb phrase (VP) which are the major types of phrases apart from propositional phrases, adjective phrases etc (Manning and Schutze, 1999). Using the two major phrase types and the rest considered as other type, we constructed a model for SELSA. This model assigns each word three syntactic descriptions depending on its frequency of occurrence in each of three phrase types across a number of documents. This model captures the semantic behaviour of each word in each phrase type. Generally nouns accur in noun phrases and verbs occur in verb phrases while prepositions occur in the other type. So this framework brings in the finer syntactic resolution in each word's semantic description as compared to LSA based average description. This is particularly more important for certain words occurring as both noun and verb.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Content or Function Word Type </SectionTitle> <Paragraph position="0"> If a text corpus is analyzed by counting word frequencies, it is observed that there are certain words which occur with very high frequencies e.g. the, and, a, to etc. These words have a very important grammatical behaviour, but they do not convey much of the semantics. Thse words are called function or stop words. Similarly in a text corpus, there are certain words with frequencies in moderate to low range e.g.</Paragraph> <Paragraph position="1"> car, wheel, road etc. They each play an important role in deciding the semantics associated with the whole sentence or document. Thus they are known as content words. Generally a list of vocabulary consists of a few hundred function words and a few tens of thousands of content words. However, they span more or less the same frequency space of a corpora. So it is also essential to give them equal importance by treating them separately in a language modeling framework as they both convey some sort of orthogonal information - syntactic vs semantic. LSA is better at predicting topic bearing content words while parsing based models are better for function words. Even n-gram models are quite better at modeling function words, but they lack the large-span semantic that can be achieved by LSA. On the other hand, SELSA model is suitable for both types of words as it captures semantics of a word in a syntactic context. null We performed experiments with LSA and SELSA with various levels of syntactic information in both the situations - content words only vs content and function words together. In the former case, the function words are treated by n-gram model only.</Paragraph> </Section> </Section> class="xml-element"></Paper>