XML Viewer - e06-1014

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/e06-1014_evalu.xml
Size: 10,123 bytes
Last Modified: 2025-10-06 13:59:31
<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1014">
  <Title>Improving Probabilistic Latent Semantic Analysis with Principal Component Analysis</Title>
  <Section position="7" start_page="108" end_page="111" type="evalu">
    <SectionTitle>
6 Results
</SectionTitle>
    <Paragraph position="0"> In this section we evaluate the performance of LSA-initialized PLSA (LSA-PLSA). We compare the performance of LSA-PLSA to LSA only and PLSA only, and also compare its use in combination with other models. We give results for a smaller information retrieval application and atext segmentation application, tasks where the reduced dimensional representation has been successfully used to improve performance over simpler word count models such as tf-idf.</Paragraph>
    <Section position="1" start_page="108" end_page="108" type="sub_section">
      <SectionTitle>
6.1 System Description
</SectionTitle>
      <Paragraph position="0"> To test our approach for PLSA initialization we developed an LSA implementation based on the SVDLIBC package (http://tedlab.mit.edu/a2 dr/SVDLIBC/) for computing the singular values of sparse matrices. The PLSA implementation was based on an earlier implementation by Brants et al. (2002). For each of the corpora, we tokenized the documents and used the LinguistX morphological analyzer to stem the terms. We used entropy weights (Guo et al., 2003) to weight the terms in the document matrix.</Paragraph>
    </Section>
    <Section position="2" start_page="108" end_page="110" type="sub_section">
      <SectionTitle>
6.2 Information Retrieval
</SectionTitle>
      <Paragraph position="0"> We compared the performance of the LSA-PLSA model against randomly-initialized PLSA and against LSA for four different retrieval tasks. In these tasks, the retrieval is over a smaller corpus, on the order of a personal document collection. We used the following four standard document collections: (i) MED (1033 document abstracts from the National Library ofMedicine), (ii) CRAN (1400 documents from the Cranfield Institute of Technology), (iii) CISI (1460 abstracts in library science from the Institute for Scientific Information) and (iv) CACM (3204 documents from the association for computing machinery). For each of these document collections, we computed the LSA, PLSA, and LSA-PLSA representations of both the document collection and the queries for a range of latent classes, or factors.</Paragraph>
      <Paragraph position="1"> For each data set, we used the computed representations to estimate the similarity of each query to all the documents in the original collection. For the LSA model, we estimated the similarity using the cosine distance between the reduced dimensional representations of the query and the candidate document. For the PLSA and LSA-PLSA models, we first computed the probability of each word occurring in the document, a39 a18 a74a76a28 a71 a26 a2  using Equation 7 and assuming that a39 a18 a71 a26 is uniform. This gives us a PLSA-smoothed term representation of each document. We then computed the Hellinger similarity (Basu et al., 1997) between the term distributions of the candidate document, a39 a18 a74a75a28 a71 a26 , and query, a39 a18 a74a75a28 a3 a26 . In all of the evaluations, the results for the PLSA model were averaged overfour different runs toaccount forthe dependence on the initial conditions.</Paragraph>
      <Paragraph position="2">  In addition to LSA-based initialization of the PLSA model, we also investigated initializing the PLSA model by first running the &amp;quot;k-means&amp;quot; algorithm to cluster the documents into a0 classes, where a0 is the number of latent classes and then initializing a39 a18 a74a75a28 a72 a26 based on the statistics of word occurrences in each cluster. We iterated over the  number of latent classes starting from 10 classes up to 540 classes in increments of 10 classes.</Paragraph>
      <Paragraph position="3">  Weevaluated theretrieval results (atthe11standard recall levels as well as the average precision and break-even precision) using manually tagged relevance. Figure 1 shows the average precision as a function of the number of latent classes for the CACM collection, the largest of the datasets.</Paragraph>
      <Paragraph position="4"> The LSA-PLSA model performance was better than both theLSAperformance and thePLSAperformance at all class sizes. This same general trend was observed for the CISI dataset. For the two smallest datasets, the LSA-PLSA model performed better than the randomly-initialized PLSA model at all class sizes; it performed better than the LSA model at the larger classes sizes where the best performance is obtained.</Paragraph>
      <Paragraph position="5">  the optimal number of latent classes is shown. The results show that LSA-PLSAoutperforms LSA on 7 out of 8 evaluations. LSA-PLSA outperforms both random and k-means initialization of PLSA in all evaluations. In addition, performance using random initialization was never worse than kmeansinitialization, whichitself issensitive toinitialization values. Thus in the rest of our experiments we initialized PLSA models using the simpler random-initialization instead of k-means initialization. null  We explored the use of an LSA-PLSA model when averaging the similarity scores from multiple models for ranking in retrieval. We compared a baseline of 4 randomly-initialized PLSA models against 2 averaged models that contain an LSA-PLSA model: 1) 1 LSA, 1 PLSA, and 1 LSA-PLSA model and 2) 1 LSA-PLSA with 3 PLSA models. We also compared these models against the performance of an averaged model without an LSA-PLSA model: 1 LSA and 1 PLSA model. In each case, the PLSA models were randomly initialized. Figure 2 shows the average precision as a function of the number of latent classes for the CISI collection using multiple models. In all class sizes, a combined model that included the LSA-initialized PLSA model had performance that was at least as good as using 4 PLSAmodels. This was also true for the CRAN dataset. For the other two datasets, the performance of the combined model wasalways better than the performance of 4PLSA models when the number of factors was no more than 200-300, the region where the best performance was observed.</Paragraph>
      <Paragraph position="6"> Table 3 summarizes the results and gives the best performing model for each task. Comparing  Tables 2 and 3, note that the use of multiple models improved retrieval results. Table 3 also indicates that combining 1 LSA, 1 PLSA and 1 LSA-PLSA models outperformed the combination of 4 PLSA models in 7 out of 8 evaluations.</Paragraph>
      <Paragraph position="7"> For our data, the time to compute the LSA model is approximately 60% of the time to computeaPLSAmodel. Therunning timeofthe&amp;quot;LSA PLSA LSA-PLSA&amp;quot; model requires computing 1 LSA and 2 PLSA models, in contrast to 4 models for the 4PLSA model, therefore requiring less than75% oftherunning timeofthe4PLSAmodel.</Paragraph>
    </Section>
    <Section position="3" start_page="110" end_page="111" type="sub_section">
      <SectionTitle>
6.3 Text Segmentation
</SectionTitle>
      <Paragraph position="0"> A number of researchers, (e.g., Li and Yamanishi (2000); Hearst (1997)), have developed text segmentation systems. Brants et. al. (2002) developed a system for text segmentation based on a PLSA model of similarity. The text is divided into overlapping blocks of sentences and the PLSA representation of the terms in each block,a39 a18a74a75a28a0 a26 , is computed. The similarity between pairs of ad- null a26 and the Hellinger similarity measure. The positions of the largest local minima, or dips, in the sequence of block pair similarity values are emitted as segmentation points.</Paragraph>
      <Paragraph position="1"> We compared the use of different initializations on 500 documents created from Reuters-21578, in a manner similar to Li and Yamanishi (2000).</Paragraph>
      <Paragraph position="2"> The performance is measured using error probability at the word and sentence level (Beeferman et al., 1997), a15  a17 , respectively. This measure allows for close matches in segment boundaries. Specifically, the boundaries must be within a0 words/sentences, where a0 is set to be half the av- null erage segment length in the test data. In order to account for the random initial values of the PLSA models, we performed the whole set of experiments for each parameter setting four times and averaged the results.</Paragraph>
      <Paragraph position="3">  We compared the segmentation performance using an LSA-PLSA model against the randomly-initialized PLSA models used by Brants et al. (2002). Table 4 presents the performance over different classes sizes for the two models. Comparingperformance atthe optimum class size foreach model, the results in Table 4 show that the LSA-PLSAmodel outperforms PLSAon both word and sentence error rate.</Paragraph>
      <Paragraph position="4">  We explored the use of an LSA-PLSA model when averaging multiple PLSA models to reduce the effect of poor model initialization. In particular, the adjacent block similarity from multiple  models was averaged and used in the dip computations. For simplicity, we fixed the class size of the individual models to be the same for a particular combined model and then computed performance over a range of class sizes. We compared a baseline of four randomly initialized PLSA models against two averaged models that contain an LSA-PLSA model: 1) one LSA-PLSA with two PLSA models and 2) one LSA-PLSA with three PLSA models. The best results were achieved using a combination of PLSA and LSA-PLSA models(seeTable5). Andallmultiple model combinations performed better than a single model (compare Tables 4 and 5), as expected.</Paragraph>
      <Paragraph position="5"> In terms of computational costs, it is less costly to compute one LSA-PLSA model and two PLSA models than to compute four PLSA models. In addition, the LSA-initialized models tend to perform best with a smaller number of latent variables than the number of latent variables needed for the four PLSA model, also reducing the computational cost.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML