File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1103_metho.xml

Size: 19,443 bytes

Last Modified: 2025-10-06 14:08:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1103">
  <Title>An Approach for Combining Content-based and Collaborative Filters</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Overview of our approach
</SectionTitle>
    <Paragraph position="0"> In this paper, we suggest a technique that introduces the contents of items into the item-based collaborative filtering to improve its prediction quality and solve the cold start problem. Shortly, we call the technique ICHM (Item-based Clustering Hybrid Method).</Paragraph>
    <Paragraph position="1"> In ICHM, we integrate the item information and user ratings to calculate the item-item similarity.</Paragraph>
    <Paragraph position="2"> Figure 1 shows this procedure. The detail procedure of our approach is described as follows: * Apply clustering algorithm to group the items, then use the result, which is represented by the fuzzy set, to create a group-rating matrix.</Paragraph>
    <Paragraph position="3"> * Compute the similarity: firstly, calculate the similarity of group-rating matrix using adjusted-cosine algorithm, then calculate the similarity of item-rating matrix using Pearson correlation-based algorithm. At last, the  total similarity is the linear combination of the above two.</Paragraph>
    <Paragraph position="4"> * Make a prediction for an item by performing a weighted average of deviations from the neighbour's mean.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Group rating
</SectionTitle>
      <Paragraph position="0"> The goal of grouping ratings is to group the items into several cliques and provides content-based information for collaborative similarity calculation.</Paragraph>
      <Paragraph position="1"> Each item has it's own attribute features, such as movie item, which may have actor, actress, director, genre, and synopsis as its attribute features. Thus, we can group the items based on them.</Paragraph>
      <Paragraph position="2"> The algorithm that is applied for grouping ratings is derived from K-means Clustering Algorithm (Han and Kamber, 2000). The difference is that we apply the fuzzy set theory to represent the affiliation between object and cluster. As shown in Figure 2, firstly, items are grouped into a given number of clusters. After completion of grouping, the probability of one object j (here one object means one item) to be assigned to a certain cluster is calculated as follows.</Paragraph>
      <Paragraph position="4"> where Pr ( , )ojkmeans the probability of object j to be assigned to cluster k ; The (,)CS j k means the function to calculate the counter-similarity between object j and cluster k ; (, )Max CS i k means the maximum counter-similarity between an object and cluster k .</Paragraph>
      <Paragraph position="5"> Input : the number of clusters k and item attributes Output: a set of k clusters that minimizes the squarederror criterion, and the probability of each item to be assigned to each cluster center, which are represented as a fuzzy set.</Paragraph>
      <Paragraph position="6">  (1) Arbitrarily choose k objects as the initial cluster centers (2) Repeat (a) and (b) until no change (a) (Re) assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster (b) Update the cluster means, i.e., calculate the mean value of the objects of each cluster; (3) Compute the probability between objects and each  The counter-similarity (,)CS j k can be calculated by Euclidean distance or Cosine method.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Similarity computation
</SectionTitle>
      <Paragraph position="0"> As we can see, after grouping the items, we get a new rating matrix. We can use the item-based collaborative algorithm to calculate the similarity and make the predictions for users.</Paragraph>
      <Paragraph position="1"> There are many ways to compute the similarity.</Paragraph>
      <Paragraph position="2"> In our approach, we use two of them, and make a linear combination of their results.</Paragraph>
      <Paragraph position="3">  The most common measure for calculating the similarity is the Pearson correlation algorithm. Pearson correlation measures the degree to which a linear relationship exists between two variables. The Pearson correlation coefficient is derived from a linear regression model, which relies on a set of assumptions regarding the data, namely that the relationship must be linear, and the errors must be independent and have a probability distribution with mean 0 and constant variance for every setting of the independent variable (McClave and Dietrich, 1998).</Paragraph>
      <Paragraph position="4">  where (,)sim k l means the similarity between item k and l ; m means the total number of users, who rated on both item k and l ;</Paragraph>
      <Paragraph position="6"> R mean the rating of user u on item k and l respectively.</Paragraph>
      <Paragraph position="7">  Cosine similarity once has been used to calculate the similarity of users but it has one shortcoming. The difference in rating scale between different users will result in a quite different similarity. For instance, if Bob only rates score 4 on the best movie, he never rates 5 on any movie; and he rates 1 on the bad movie, instead of the standard level score 2. But Oliver always rates according to the standard level. He rates score 5 on the best movie, and 2 on the bad movie. If we use traditional cosine similarity, both of them are quite different. The adjusted cosine similarity (Sarwar et al., 2001) was provided to offset this drawback.</Paragraph>
      <Paragraph position="8">  R mean the rating of user u on item k and l respectively.</Paragraph>
      <Paragraph position="9">  Due to difference in value range between item-rating matrix and group-rating matrix, we use different methods to calculate the similarity. As for item-ratings matrix, the rating value is integer; As for group-rating matrix, it is the real value ranging from 0 to 1. The natural way is to enlarge the continuous data range from [0 1] to [1 5] or reduce the discrete data range from [1 5] to [0 1] and then apply Pearson correlation-based algorithm or adjusted cosine algorithm to calculate similarity. We call this enlarged ICHM. We also propose another method: firstly, use Pearson correlation-based algorithm to calculate the similarity from item-rating matrix, and then calculate the similarity from group-rating matrix by adjusted cosine algorithm, at last, the total user similarity is linear combination of the above two, we call this combination</Paragraph>
      <Paragraph position="11"> sim k l sim k l c sim k l c=x+ x where (,)sim k l means the similarity between item</Paragraph>
      <Paragraph position="13"> sim k l means that the similarity between item k and l , which is calculated from item-rating ma-</Paragraph>
      <Paragraph position="15"> sim k l means that the similarity between item k and l , which is calculated from group-rating matrix.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Collaborative prediction
</SectionTitle>
      <Paragraph position="0"> Prediction for an item is then computed by performing a weighted average of deviations from the neighbour's mean. Here we use top N rule to select the nearest N neighbours based on the similarities of items. The general formula for a prediction on item k of user u (Resnick et al., 1994)  average ratings on item k ; (,)sim k i means the similarity between item k and its' neighbour i ;</Paragraph>
      <Paragraph position="2"> means the average ratings on item i .</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Cold start problem
</SectionTitle>
      <Paragraph position="0"> In traditional collaborative filtering approach, it is hard for pure collaborative filtering to recommend a new item to user since no user made any rating on this new item. However, in our approach, based on the information from group-rating matrix, we can make predictions for the new item. In our experiment, it shows a good recommendation performance for the new items. In Equation 5,</Paragraph>
      <Paragraph position="2"> the average rating of all ratings on item k . As for the new item, no user makes any rating on it,</Paragraph>
      <Paragraph position="4"> should be the zero. Since k R is the standard base-line of user ratings and it is zero, it is unreasonable for us to apply Equation 5 to new item. Therefore, for a new item, we use the neighbors R , the average rating of all ratings on the new item's nearest  The following is a procedure of our approach. * Based on the item contents, such as movie genre, director, actor, actress, even synopsis, we apply clustering algorithm to group the items. Here, we use fuzzy set to represent the clustering result. Assume the result is as follows: Cluster 1: {Gone with the Wind (98%), Swordfish (100%), Pearl Harbour (1.0%), Hero (95%), The Sound of Music (0.12%)}, Cluster 2: {Gone with the Wind (0.13%), Swordfish (0.02%), Pearl Harbour (95%), Hero (1.2%), The Sound of Music (98%)}, the number in the parenthesis following the movie name means the probability of the movie belonging to the cluster. * We use group-rating engine to make a group-rating matrix. As Table 2 shows.</Paragraph>
      <Paragraph position="5"> Then combine the group-rating matrix and item-rating matrix to form a new rating matrix. null * Now, we can calculate the similarity between items based on this new unified rating data matrix. The similarity between items consists of two parts. The first part calculates the similarity based on user ratings, using the Pearson correlation-based algorithm. The second part calculates the similarity based on the clustering result by using adjusted cosine algorithm. The total similarity between items is the linear combination of them. For example, when we calculate the similarity between Gone with the Wind and</Paragraph>
      <Paragraph position="7"> Secondly, sim(G,S) is calculated based on Formula 4, here the combination coefficient is 0.4.</Paragraph>
      <Paragraph position="8"> sim(G,S)=1 (1-0.4)+0.9999 0.4=0.9999 xx * Then, predictions for items are calculated by performing a weighted average of deviations from the neighbour's mean.</Paragraph>
      <Paragraph position="9"> In the example, we can observe, the item - The Sound of Music, which no one make any rating on, can be treated as a new item. In traditional item-based collaborative method, which makes prediction only based on item-based matrix (Table 1), it is impossible to make predictions on this item. However, in our approach, we can make prediction for users, based on group rating (Table 2).</Paragraph>
      <Paragraph position="10"> From the description of our approach, we can observe that this approach can fully realize the strengths of content-based filtering, mitigating the effects of the new user problem. In addition, when calculating the similarity, our approach considers the information not only from personal tastes but also from the contents, which provides a latent ability for better prediction and makes serendipitous recommendation.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.6 UCHM
</SectionTitle>
      <Paragraph position="0"> Clustering technique not only can be applied to item-based collaborative recommenders but also can be applied to user-based collaborative recommenders. Shortly we call the late one UCHM (User-based Clustering Hybrid Method) In UCHM, clustering is based on the attributes of user profiles and clustering result is treated as items. However, in ICHM, clustering is based on the attributes of items and clustering result is treated as users, as Figure 3 shows.</Paragraph>
      <Paragraph position="1"> In Combination UCHM, we apply Equation 2 to calculate the similarity in user-rating matrix, and</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
User-rating Matrix
Group-rating Matrix
Group-rating Matrix
Item-rating Matrix
</SectionTitle>
      <Paragraph position="0"> Equation 3 to calculate the similarity in group-rating matrix. Then make a linear combination of them. When we apply Equation 2 and 3 to UCHM, k and l mean the user and u means the item, instead the original meaning.</Paragraph>
      <Paragraph position="1"> As for UCHM, clustering is based on the user profiles. User profiles indicate the information needs or preferences on items that users are interested in. A user profile can consist of several profile vectors and each profile vector represents an aspect of his preferences, such as movie genre, director, actor, actress and synopsis. The profile vectors are automatically constructed from rating data by the following simple equation.</Paragraph>
      <Paragraph position="2"> ( ) / 8Amn= where, n is the number of items whose ranking value is lager than a given threshold, m is the number of items containing attribute A among n items and its ranking is larger than threshold. In our experiment, we set the value of the threshold as 3.</Paragraph>
      <Paragraph position="3"> For example, in Section 3.5, Tom makes ratings on four movies, and three of them lager than the threshold 3. From the genre information, we know Gone with the Wind belongs to love genre, swordfish and Hero belong to action genre. So Tom's profile is as follows. Tom {love (1/3), action (2/3)}.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experimental evaluations
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Data set
</SectionTitle>
      <Paragraph position="0"> Currently, we perform experiment on a subset of movie rating data collected from the MovieLens web-based recommender. The data set contained 100,000 ratings from 943 users and 1,682 movies, with each user rating at least 20 items. We divide data set into a training set and a test data set.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Evaluation metrics
</SectionTitle>
      <Paragraph position="0"> MAE (Mean Absolute Error) has widely been used in evaluating the accuracy of a recommender system by comparing the numerical recommendation scores against the actual user ratings in the test data. The MAE is calculated by summing these absolute errors of the corresponding rating-prediction pairs and then computing the average.  P means the user u prediction on item i ; ,ui R means the user u rating on item i in the test data; n is the number of rating-prediction pairs between the test data and the prediction result. The lower the MAE, the more accurate.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Behaviours of our method
</SectionTitle>
      <Paragraph position="0"> We implement group-rating method described in section 3.1 and test them on MovieLens data with the different number of clusters. Figure 4 shows the experimental results. It can be observed that the number of clusters does affect the quality of prediction, no matter in UCHM or ICHM.</Paragraph>
      <Paragraph position="1">  In order to find the optimal combination coefficient c in the Equation 4, we conducted a series of experiments by changing combination coefficient from 0 to 1 with a constant step 0.1. Figure 5 shows that when the coefficient arrives at 0.4, an optimal recommendation performance is achieved.</Paragraph>
      <Paragraph position="2">  As described in Section 3.2, our grouping ratings method needs to calculate similarity between  objects and clusters. So, we try two methods - one is Euclidean distance and the other cosine angle. It can be observed in Figure 6 that the approach of cosine angle method has a trend to show better performance than the Euclidean Distance method, but the difference is negligible.</Paragraph>
      <Paragraph position="3">  From the Figure 7, it can be observed that the performance of combination ICHM is the best, and the second is the enlarged ICHM, which is followed by the item-based collaborative method, the last is UCHM (User-based Clustering Hybrid Method) which applies the clustering technique described in Section 3 to user-based collaborative filtering, where user profiles are clustered instead of item contents.</Paragraph>
      <Paragraph position="4"> We also can observe that the size of neighbourhood does affect the quality of prediction (Herlocker et al., 1999). The performance improves as we increase the neighbourhood size from 10 to 30, then tends to be flat.</Paragraph>
      <Paragraph position="5">  As for cold start problem, we choose the items from the training data set and delete all the ratings of those items, thus we can treat them as new items. First, we randomly selected item No.946. In the test data, user No.946 has 11 ratings, which is described by bar real value in Figure 8. We can observe that the prediction for a new item can partially reflect the user preference. To generalize the observation, we randomly select the number of items from 10 to 50 with the step of 10 and 100 from the test data, and delete all the ratings of those items and treat them as new items. Table 3 shows that ICHM can solve the cold start problem.  When we apply clustering method to movie items, we use the item attribute - movie genre. However, our approach can consider more dimension of item attribute, such as actor, actress, and director, even the synopsis. In order to observe the effect of the high dimension item attributes, we collect the 100 movie synopsis from Internet Movie Database (http://www.imdb.com) to provide attribute information for clustering movies. In our experiment, it shows that the correct attributes of movies can further improve the performance of recommender system, as Figure 9 shows.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Our method versus the classic one
</SectionTitle>
      <Paragraph position="0"> Although some hybrid recommender systems have already exited, it is hard to make an evaluation among them. Some systems (Delgado et al., 1998) use Boolean value (relevant or irrelevant) to represent user preferences, while others use numeric value. The same evaluation metrics cannot make a fair comparison. Further more, the quality of some systems depends on the time, in which system parameters are changed with user feedback (Claypool et al., 1999), and Claypool does not clearly describe how to change the weight with time passed.</Paragraph>
      <Paragraph position="1"> However, we can make a simple concept comparison. In Fab system, the similarity for prediction is only based on the user profiles. As for UCHM, which groups the content information of user profiles and uses user-based collaborative algorithm instead of ICHM, the impact of combination coefficient can be observed in Figure 5. In UCHM, when the value of coefficient equals to 1, it describes condition that Fab applied, which means the similarity between users is only calculated from the group-rating matrix. In that condition, the MAE shows the worst quality of recommendation.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML