File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1014_metho.xml
Size: 8,859 bytes
Last Modified: 2025-10-06 14:10:07
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1014"> <Title>Improving Probabilistic Latent Semantic Analysis with Principal Component Analysis</Title> <Section position="4" start_page="105" end_page="106" type="metho"> <SectionTitle> 3 LSA </SectionTitle> <Paragraph position="0"> We briefly review the LSA model, as presented in Deerwester et al. (1990), and then outline the LSA-based probability model presented in Ding (1999).</Paragraph> <Paragraph position="1"> The term to document association is presented as a term-document matrix</Paragraph> <Paragraph position="3"> containing the frequency of the a32 index terms occurring in a33 documents. The frequency counts can also be weighted to reflect the relative importance of individual terms (e.g., Guo et al., (2003)). a20a35a34 is an a32 dimensional column vector representing document a0 and a29a2a1 is an a33 dimensional row vector representing terma3 . LSA represents terms and documents in a new vector space with smaller dimensions that minimize the distance between the projected terms and the original terms. This is done through the truncated (to rank a0 ) singular value decomposition a1 a4 a1a6a5 a2a8a7a9a5a11a10a12a5a14a13a16a15a17 or Among all a32a27a26 a33 matrices of rank a0 , a1a6a5 is the one that minimizes the Frobenius norm a28a29a28a1a31a30 a1a32a5 a28a29a28a33a34</Paragraph> <Section position="1" start_page="106" end_page="106" type="sub_section"> <SectionTitle> 3.1 LSA-based Probability Model </SectionTitle> <Paragraph position="0"> The LSA model based on SVD is a dimensionality reduction algorithm and as such does not have a probabilistic interpretation. However, under certain assumptions on the distribution of the input data, the SVD can be used to define a probability model. In this section, we summarize the results presented in Ding (1999) of a dual probability representation of LSA.</Paragraph> <Paragraph position="1"> Assuming the probability distribution of a document a20a25a34 is governed by a0 characteristic (normalized) document vectors, a35 a22 a10a11a10a11a10 a35 a5 , and that the a35 a22 a10a11a10a11a10 a35 a5 are statistically independent factors, Ding (1999) shows that using maximum likelihood estimation, the optimal solution for</Paragraph> <Paragraph position="3"> where a60 a18a20a19 a22 a10a11a10a11a10 a19a21a5 a26 is a normalization constant.</Paragraph> <Paragraph position="4"> The dual formulation for the probability of term a29 in terms of the tight eigenvectors (i.e., the document representations a23 a22 a10a11a10a11a10 a23 a5 a26 of the matrix a1a9a5 is:</Paragraph> <Paragraph position="6"> where a60 a18a23 a22 a10a11a10a11a10 a23 a5 a26 is a normalization constant.</Paragraph> <Paragraph position="7"> Ding also shows thata19 a1 is related toa23 a1 by:</Paragraph> <Paragraph position="9"> We will use Equations 3-5 in relating LSA to PLSA in section 5.</Paragraph> </Section> </Section> <Section position="5" start_page="106" end_page="107" type="metho"> <SectionTitle> 4 PLSA </SectionTitle> <Paragraph position="0"> The PLSA model (Hofmann, 1999) is a generative statistical latent classmodel: (1)select adocument</Paragraph> <Paragraph position="2"> The joint probability between a word and docu-</Paragraph> <Paragraph position="4"> and using Bayes' rule can be written as:</Paragraph> <Paragraph position="6"> Hofmann (1999) uses the EM algorithm to compute optimal parameters. The E-step is given by</Paragraph> <Paragraph position="8"/> <Section position="1" start_page="106" end_page="107" type="sub_section"> <SectionTitle> 4.1 Model Initialization and Performance </SectionTitle> <Paragraph position="0"> An important consideration in PLSA modeling is that the performance of the model is strongly affected by the initialization of the model prior to training. Thus a method for identifying a good initialization, or alternatively a good trained model, is needed. If the final likelihood value obtained after training was well correlated with accuracy, then one could train several PLSA models, each with a different initialization, and select the model with the largest likelihood as the best model. Although, for a given initialization, the likelihood increases to a locally optimal value with each iteration of EM, the final likelihoods obtained from different initializations after training do not correlate well with the accuracy of the corresponding models. This is shown in Table 1, which presents correlation coefficients between likelihood values and either average or breakeven precision for several datasets with 64 or 256 latent classes, i.e., factors. Twenty random initializations were used per evaluation. Fifty iterations of EM per initialization were run, which empirically is more than enough to approach the optimal likelihood. The coefficients range from -0.64 to 0.25. The poor correlation indicates the need for a method to handle the variation in performance due to the influence of different initialization values, for example through better initialization methods.</Paragraph> <Paragraph position="1"> Hofmann(1999) andBrants(2002) averaged results from five and four random initializations, respectively, and empirically found this to improve performance. The combination of models enables redundancies in the models to minimize the expression of errors. We extend this approach by replacing one random initialization with one reasonably good initialization in the averaged models. We will empirically show that having at least one reasonably goodinitialization improvestheperformance oversimply using anumber ofdifferent initializations. null</Paragraph> </Section> </Section> <Section position="6" start_page="107" end_page="108" type="metho"> <SectionTitle> 5 LSA-based Initialization of PLSA </SectionTitle> <Paragraph position="0"> The EM algorithm for estimating the parameters of the PLSA model is initialized with estimates of the model parameters a39 a18 a72 a26 a36 a39 a18 a74a76a28a72 a26 a36 a39 a18 a71 a28a72 a26 . Hofmann (1999) relates the parameters of the PLSA model to an LSA model as follows:</Paragraph> <Paragraph position="2"> a26 of the PLSA model and the mixing proportions of the latent classes in PLSA, a39 a18 a72 a26 , correspond to the singular values of the SVD in LSA.</Paragraph> <Paragraph position="3"> Note that we can not directly identify the matrix</Paragraph> <Paragraph position="5"> and a13a9a5 contain negative values and are not probability distributions. However, using equations 3 and 4, we can attach a probabilistic interpretation to LSA, and then relate a7a23a1a12a3a6a5a8a7 and a13a2a1a12a3a6a5a8a7 with the corresponding LSA matrices. We now outline this relation.</Paragraph> <Paragraph position="6"> Equation 4 represents the probability of occurrence of term a29a2a1 in the different documents conditioned on the SVD right eigenvectors. The a3 a36 a0a25a24a27a26 element in equation 15 represent the probability of term a74 a69 conditioned on the latent class a72 a17 . As in the analysis above, we assume that the latent classes in the LSA model correspond to the latent classes of the PLSA model. Making the simplifying assumption that the latent classes of the LSA model are conditionally independent on term a29 a1 , we can express the a39 a18 a29a2a1 a28a23 a22 a24a61a24a61a24 a23 a5 a26 as:</Paragraph> <Paragraph position="8"> Thus, other than a constant that is based on a39 a18 a29 a1 a26 and a60 a18 a23 a26 , we can relate each a39 a18 a29a2a1 a28a23a33a32 a26 to a corresponding a40a14a41a43a62a65a64a59a46a67 a44 a49 a50 . We make the simplifying assumption that a39 a18 a29a2a1 a26 is constant across terms and normalize the exponential term to a probability:</Paragraph> <Paragraph position="10"> Relating the term a74 a17 in the PLSA model to the distribution of the LSA term over documents, a38 a17 , and relating the latent class a72 a69 in the PLSA model to the LSA right eigenvector a0 a69 , we then estimate Similarly, relating the document a71 a69 in the PLSA model to the distribution of LSA document over The singular values, a22 a17 in Equation 2, are by definition positive. Relating these values to the mixing proportions, a39 a18 a72 a17 a26 , we generalize the relation using a function a1 a18 a26 , where a1 a18 a26 is any non-negative function over the range of all a22 a17 , and normalize so that the estimated a39 a18 a72 a17 a26 is a probability: We have experimented with different forms of a1 a18 a26 including the identity function and the logarithmic function. For our experiments, we used a1 a18 a22 a26 a2</Paragraph> <Paragraph position="12"> In our LSA-initialized PLSA model, we initialize the PLSA model parameters using Equations 19-21. The EM algorithm is then used beginning with the E-step as outlined in Equations 9-12.</Paragraph> </Section> class="xml-element"></Paper>