Abstract

Recommender systems are widespread due to their ability to help Web users surf the Internet in a personalized way. For example, collaborative recommender system is a powerful Web personalization tool for suggesting many useful items to a given user based on opinions collected from his neighbors. Among many, similarity measure is an important factor affecting the performance of the collaborative recommender system. However, the similarity measure itself largely depends on the overlapping between the user profiles. Most of the previous systems are tested on a predefined number of common items and neighbors. However, the system performance may vary if we changed these parameters. The main aim of this paper is to examine the performance of the collaborative recommender system under many similarity measures, common set cardinalities, rating mean groups, and neighborhood set sizes. For this purpose, we propose a modified version for the mean difference weight similarity measure and a new evaluation metric called users’ coverage for measuring the recommender system ability for helping users. The experimental results show that the modified mean difference weight similarity measure outperforms other similarity measures and the collaborative recommender system performance varies by varying its parameters; hence we must specify the system parameters in advance.

1. Introduction

Today, Web users face an abundance of choices when they surf the Web. Hence, recommender systems (RSs) as many Web personalization tools become necessary to offer Web users personalized items they may like. These systems become available in many Web sites that cover social networks, e-commerce, e-business, e-tourist, and many others [1, 2]. A recent survey for the applications of the recommender systems and their classifications is given in [2].

Basically, RS compares users based on a suitable similarity measure which plays an important rule for the success of the whole system. However, different similarity measures often lead to different sets of neighbors for a given active user. A good similarity measure will produce a close set of neighbors for a given active user [3]. Actually, many of the existing similarity measures for collaborative recommender systems rely on the overlapping between users. Nevertheless, the size of this overlapping is not explored in detail where most of the previous work studied similarity measures based on a predefined number of common items [38].

This motivates us to study the cardinality effect of the common set on the performance of different similarity measures for collaborative recommender systems. The proximity between two users based upon a single commonly rated item is surely weaker than that of 20 common rated items. Moreover, the second case is more reliable because close sets of neighbors are guaranteed. This paper studies the effect of three parameters, namely, cardinality of the common rated items, the rating mean group, and the number of neighbors on the performance of the collaborative recommender system. The contributions of this paper are threefold:(1)The notion of users’ coverage is introduced as opposed to items’ coverage.(2)We proposed a modified version for mean difference weights similarity measure.(3)Our experiments are implemented on both synthetic and real data sets.

The rest of this paper is organized as follows: a literature review is given in Section 2 and an introduction to collaborative recommender systems is given in Section 3. The effect of the cardinality of the common set on the performance of different similarity measures is introduced in Section 4. Section 5 presents the experimental methodology used for examining many similarity measures while Section 6 discusses the results of the conducted experiments. Finally, we conclude our work in Section 7.

2. Literature Review

Many papers have discussed and proposed many similarity measures, but they fixed the lowest number of the common items in advance and examined their proposals based on that predefined number [39]. For example, Al-Shamri [3] examined traditional approaches and proposed a power coefficient as a similarity measure but he assumed the common set size is greater than or equal to five. Breese et al. [9] did empirical analysis for many similarity measures used for collaborative filtering and realized the effect of a low size of common set. Therefore, they suggested default voting for unrated items to enhance the system performance. The same approach is used by many authors to overcome the low number of common items [10]. This is in agreement with the findings of [11] where they showed that participants of their system were more confident in their choices when the recommender had a high rating overlap with the decision maker. However, default voting approach may not reflect the actual user taste for unrated items.

Usually, active users correlate highly with neighbors having very small number of corated items. These neighbors are terrible predictors because they were based on tiny samples of common items. Authors of [12, 13] devalue similarity weights that were based on a small number of corated items by adding a correlation significance-weighting factor of to the original weight, where is the number of corated items. Thus, a full contribution will be only for those users having greater than 50 common items with the active user. However, for many cases it is impossible to find this much of common items.

Vozalis and Margaritis [14] tested Pearson correlation coefficient (PCC) with a fixed number of common items of 25. Later, they tested the same system under many threshold values of common items, namely, 1, 10, 20, 30, and 40. They assumed that the best common item threshold is 20. However, they fixed the number of neighbors and did not consider the user mean rating group in their work. Moreover, all experiments they carried out tested only one similarity measure, PCC. Our paper studies the system performance under varying similarity measure and three other parameters, namely, cardinality of the common set, user mean rating group, and the number of neighbors for the active user. These parameters are tested using synthetic and real datasets. Five choices of common items are tested; the first choice assumes at least one common item and then at least 5, 10, 15, and 20 common items for the remaining choices.

3. Collaborative Recommender System

Many types of recommender systems are proposed based on the way they build user models and their work principle [1]. Systematically, any recommender system passes through five phases to perform its job, namely, data collection, profile formulation, similarity computation, neighborhood set selection, and predictions and recommendations. The effect of each phase on the RS performance depends on its position on the RS stack. The early phases affect the RS more because the performance of the next stages depends on them. The most popular RS is the collaborative recommender system (CRS) which relies on the opinions of possible similar users and hence allows the system to recommend out-of-the-box items [4, 5].

Formally, CRS has users, , rating explicitly or implicitly items, and , such as news, Web pages, books, movies, or CDs. Each user has rated a subset of items . The declared rating of user for an item is denoted by and the user’s average rating is denoted by [4].

During similarity computation phase, the RS matches the active user to the available database of the training users according to a suitable similarity measure. This value is a measure of how closely two users resemble each other. Once similarity values are computed, the system ranks users according to their similarity values with the active user to extract a set of neighbors for him. After that the CRS assigns a predicted rating to all the items seen by the neighborhood set and not by the active user. The predicted rating, , indicates the expected interestingness of the item to the user [3, 8].

4. Common Set Cardinality and the Similarity Measure

Similarity computation is the third phase for building a recommender system. Obviously, the accuracy and the reliability of this phase rely largely on the two phases below it. This paper concentrates on the similarity computation phase and assumes that all remaining phases are fixed and accurate except changing the number of neighbors for some experiments. For similarity computation, many similarity measures are used in literature and this paper will examine only three of them. The first similarity measure is Pearson correlation coefficient (PCC) [35, 14] which is the most popular similarity function for memory-based CRS:

PCC computes the similarity between two users based on the common ratings, , both users have declared: where is the cardinality of the common set, . Logically, as the common set becomes large, we expect that the computed similarity reflects the true value of similarity between any two users.

The second similarity measure we examined is called cosine similarity measure [1, 3]. This measure treats each user as a vector in the items’ space and finds the cosine of the angle between the two vectors as a measure for the similarity between them:

Again, the common set is the core for this calculation. A third similarity measure is called mean difference weights (MDW) similarity measure proposed by Bobadilla et al. [7]. They used genetic algorithm (GA) learning for evolving the weights for the rating differences between users. However, these weights can be assumed to be fixed to the mean of each difference weight interval that has been proposed in [7]. For our experiments, we set as done by Al-Shamri and Al-Ashwal [8]:

We took into account the point raised by [8] about dividing formula (4) by the difference between the maximum and minimum values of the rating scale. They argued that this factor is not necessary because formula (4) already divides the weights by their number. The numerator cannot exceed in any way since . The only effect this factor has is to reduce the similarity values, which in turn will reduce the contribution of each neighbor’s rating in the aggregation process.

4.1. Modified Mean Difference Weights Similarity Measure

The mean difference weights similarity measure does not take the user’s mean into consideration because it was proposed for learning algorithms like GA. However, a correction factor based on the rating means of the two users in consideration can be added when we fix the weights and rely on the direct calculations without a learning algorithm. In this paper, we propose the following correction factor:

The modified mean difference weights similarity measure is simply formula (4) multiplied by the correction factor that takes the mean of each user into consideration:

Because of the correction factor, this measure does not give high similarity value for users with one common item if their rating means are different. This similarity measure is called modified mean difference weights (MMD).

4.2. User Coverage

By increasing the cardinality of the common set, we expect that the recommender system could not help all the active users. Therefore, we have to measure the system ability to help the intended users through a measure we call it users’ coverage or penetration, which is different from that of items’ coverage (this will be discussed later). This metric is defined by the following definition.

Definition 1 (users coverage). The users’ coverage of a given recommender system with a minimum predefined cardinality of the common set is the number of users benefitting from the recommender system (who can get neighbors and hence predictions) over all the active users of the system:where is the number of active users getting predictions from the system with a given cardinality of the common set (CS) and is the total number of the active users.

This measure helps us to study the effect of increasing the cardinality of the common set on the usability of the system. Low value of the users’ coverage means that the system could not help many users because they have low overlapping between them.

4.3. Sample Dataset for Empirical Analysis

We construct a sample dataset in Table 1 for 19 users with 12 items. The first user is the active user and the remaining 18 users are training users. The zero value for an item indicates that this item has not been rated yet by the user and therefore can be suggested for him. The sample data set has the following specifications:(i)It should cover many cardinalities of the common set. Therefore, we take many values 1, 2, 5, 8, and 10.(ii)It should cover three rating mean groups (small, medium, and high).(iii)The sample data is arranged such that one opposite-minded user and three users with different rating means (low, medium, and high) are available for each cardinality of the common set.(iv)The last two users represent opposite-minded users to the active user with two different common items, 8 and 10.(v)For each rating mean group, the user with the bigger cardinality of the common set inherits the same items of the user with the lower cardinality to see the effect of increasing the cardinality of the common set without changing the previous set of items.

The similarity values between the active user and the training users of the sample dataset are listed in Table 1 for the above four similarity measures. The results show the following observations.(1)The similarity value of with is 1 while it is only 0.612 with which inherits the same ratings of and adds more common items with the active user. In general, and have 8 common items and out of them four are with positive taste similarity and four are with negative taste similarity. That means the two users agree in their mode for positive and negative items. Hence, we expect higher similarity value than that with . However, the similarity value is only 0.612. That means low cardinality misleads the RS by giving high similarity values for users having a very little history with the active user.(2)Three similarity measures (PCC, COS, and MWD) give a full positive similarity value for and with the active user even if they have only one common item with . Moreover, and inherit the same ratings of and , respectively, but they get lower similarity values by all similarity measures. That means increasing the common set cardinality reduces the similarity values.(3)PCC with only one common item will give always the maximum similarity value irrespective of the rating mean values of the two users. This is because the numerator of the formula always equals the denominator for this case.(4)MWD gives the same similarity value, 0.75, for , , and . These users have two common items with and belong to three different rating mean groups. This case is changed with MMD, which takes the mean of the users into consideration.(5) has zero similarity (neutral similarity value in PCC scale) even if it is more close to than . This case is corrected when the cardinality of the common set is increased to 5 for the same rating mean group.(6)PCC can identify opposite-minded users easily where it gives −1 similarity value for and . That means the two users are dissimilar with the active user. PCC gives also negative results for all opposite-minded users, , , , and .(7)The results show that COS cannot identify opposite-minded users where it gives high similarity values 0.636 and 0.565 for and , respectively. High results are also given for all opposite-minded users, , , , and . These values are very misleading to the RS.

To get an overall view for each rating mean group, Table 2 summarizes the results for each rating mean group. We can see the following observations from this table.(i)Usually, similarity values decrease as we increase the cardinality of the common set. This is because of the increased number of items in the user profile.(ii)PCC is very sensitive to the cardinality of the common set because it calculates the rating deviation from the rating mean not the rating itself. This deviation spans negative and positive values and becomes more variable as we increase the common items. Hence, it gives different similarity values for different common sets and these values decrease by increasing the cardinality of the common set. Accordingly, many users may get unfair similarity values, hence increasing or reducing their contribution to the active user based on these values.(iii)PCC performance with opposite-minded users is reasonable as it always gives negative values with a low deviation value. That means PCC has good capabilities for capturing opposite-minded users and hence can easily prevent opposite-minded users from being neighbors for a given active user.(iv)COS is less sensitive to the common set cardinality. It always gives high values even for opposite-minded users like group four in Table 2. These values decrease a little bit by increasing the cardinality of the common set. That means this similarity measure is not good for capturing neither like-minded users nor opposite-minded users.(v)MWD and MMD give similarity values that have less deviation than that of PCC. The weakness of these two measures lies in their ability for capturing opposite-minded users. They give zero similarity values for and and this value represents a neutral case. This weakness can be alleviated by counting only those users having greater than zero similarity value.

5. Experiments

This section discusses the methodology of choosing the dataset for our experiments, the way of dividing the dataset into training and test subsets, and the metrics used for evaluating the system performance. The selected dataset should reflect different mean groups and different cardinalities of the common set to study both effects. The following subsections analyze the MovieLens dataset and select the experiments dataset for this paper.

5.1. One Million MovieLens Dataset

One million MovieLens dataset consists of 1000209 ratings assigned by 6040 users on 3900 movies [15]. This paper studies three parameters of the collaborative recommender system, namely, common set cardinality, the user rating mean group, and the number of neighbors for a given active user which have a direct effect on the performance of the similarity measure. Therefore, we further subdivide each dataset into three data subsets according to the rating mean group of each user. We consider three rating mean groups, low mean group (between 1 and 2.4), medium mean group (between 2.4 and 3.4), and high mean group (between 3.4 and 5). Accordingly, nine data subsets can be obtained in Table 3.

The results of Table 3 show that only less than 1% of the total users have low rating mean. Most users usually rate items highly where 78.15% of the total users have high rating. Hence, this dataset represents a skewed real dataset.

5.2. Experimental Dataset

We conduct our experiments using an elected dataset from the 1 M MovieLens dataset [15]. To test the similarity measures under all possible rating mean groups and all possible rating behaviors, we select from Table 3 the group that has the lowest number of users as the benchmark. Therefore, our dataset takes the subsets of the low rating mean group as the reference for the other rating mean groups. All users of this group belong to the low rating mean group but reflect three different rating behaviors, low number of ratings (13), medium number of ratings (15), and high number of ratings . Accordingly, we randomly select 36 users from the other two rating mean groups for our dataset. Hence, we have 36 users by 3 rating mean groups, or equivalently 108 users.

For experiments, we use leave-one-out cross-validation, which uses each time one user of the dataset as the test user and the remaining users as the training users. Thus each user is used the same number of times for training and once for testing. Thus the number of total users, training users, and active users are , , and , respectively.

During the testing phase, the set of active user declared ratings, , is divided randomly into two disjoint sets, namely, training ratings (20%) and test ratings (80%) such that . The RS treats as the only declared ratings while are treated as unseen ratings that the system would attempt to predict for testing the RS performance [3].

5.3. Conducted Experiments

We conduct four experiments on the 108 users’ dataset, the first experiment uses Pearson correlation coefficient (formula (1)) for the similarity computation and we call it Correlation-Based RS (CBRS). The second experiment uses cosine vector similarity measure (formula (3)) for the similarity computation and we call it Cosine Vector RS (CVRS). The third experiment uses mean difference weights similarity measure (formula (4)) for the similarity computation and we call it Difference Weights RS (DWRS). The fourth experiment uses modified mean difference weights similarity measure (formula (6)) for the similarity computation and we call it MDRS.

Each experiment is performed five times with different cardinalities of the common set. First, we assume the cardinality of the common set to be greater than or equal to one, and then we increase it by five each time until we reach 20. Table 4 illustrates different cardinalities of the common set and their codes.

5.4. Evaluation Metrics

The performance of each examined CRS is evaluated using items’ coverage, percentage of the correct predictions (PCP), and mean absolute error (MAE) [3, 16, 17]. Items’ coverage measures the percentage of items for which a CRS can provide predictions, that is, the number of items for which the CRS can generate predictions for a given active user over the total number of unseen items for the same user. The total items’ coverage over all active users is

Here, is the total number of predicted items for user and is the total number of the active users. Low item coverage value indicates that the CRS will not be able to assist the user with many of the items he has not rated.

The PCP is the percent of the correctly predicted items by the system to the total number of items in the test ratings set of the active user. The set of correctly predicted items for a given active user and the total PCP over all the active users are defined by the following formulae [18]:

The MAE measures the deviation of predictions generated by the CRS from the true ratings specified by the active user [3, 16, 17]. The MAE over all the active users () is given by formula (10). Lower MAE corresponds to more accurate predictions of a given CRS:

The predicted rating, , is usually computed as an aggregate of the ratings of ’s neighborhood set for the same item . The common prediction formulae are [1, 3]where denotes the set of neighbors for who have rated item and is the average rating of user . We used priority-based prediction formula used in [3, 8] where formula (11b) is used first. If its predicted rating is out of the rating range, then formula (11a) is used. For all experiments, the neighborhood set size is varied from 10 to 50 by a step size of 20 each time, .

6. Analysis of the Results

The results show that many active users could not gain a benefit from the systems when the cardinality of the common set is increased. The users’ coverage values for all systems are listed in Table 5. The values of the users’ coverage of CBRS are somehow less than that of CVRS because CBRS similarity values can be positive or negative. Usually, negative similarity values are discarded as they are for dissimilar users with the active user and hence the number of users with positive similarity values will be less. The users with negative similarity values are always discarded as they are not trusted users.

The results show that the users’ coverage is high for low cardinality values and starts decreasing by increasing the cardinality of the common set for all systems. The lowest value is for CBRS, which indicates that only 50% of the active users can get recommendations when the cardinality of the common set is 5. Users’ coverage values of MDRS are similar to that of DWRS as they differ only in the correction factor.

Actually, the similarity measure is a crucial part for any CRS. However, our experiments show that this impact depends largely on the cardinality of the common set. The results of all systems for all metrics, three neighborhood set sizes, and five cardinalities of the common set are listed in Table 6. These results show that we cannot compare different similarity measures without taking into account the cardinality of the common set.

Horizontally, the results are improved with increasing the cardinality of the common set and also vertically with increasing the neighborhood set sizes. This improvement gets narrower as we move in both directions. We may conclude that implementing RS with high value of neighborhood set size and low cardinality of common sets hides the effectiveness of the similarity measure. In general, increasing the cardinality of the common set has two contradictory effects on the CRS performance. Negatively, it reduces the users’ coverage and hence the system could not help as many users as before. Positively, it enhances the CRS ability for identifying true neighbors and hence increases the system accuracy.

For more clarifications, we computed the improvement percentages between the results of CS1 and CS5 in terms of PCP, coverage, and MAE for all systems. For comparison purposes, we used the following two formulae for measuring increase improvement percentages (for PCP and coverage) and decrease improvement percentages (for MAE) [18]:

Table 7 summarizes the improvement percentages of PCP, coverage, and MAE for CBRS, CVRS, DWRS, and MDRS. The results show that there are large improvement percentages between systems using low cardinalities of the common set and those using large cardinalities. The improvement percentages are very high for especially for CVRS where they reach 374.31%, 280.31%, and 92.98% for PCP, coverage, and MAE, respectively. That means close neighbors and consequently predictors are found with CS5. Actually, we expect good results for a given active user if he could find a close set of neighbors even if the number of neighbors is low (). Low cardinality of the common set does not allow the system to elect representative neighbors and thus the performance is low. The improvement percentages decrease as we increase the neighborhood set sizes. This is a usual result because the set of neighbors gets wider and therefore the chance is more to get close neighbors than before.

Another important point is that MDRS gets the lowest improvement percentages (only 42.63%) even if its overall performance is better than the other ones. This indicates that this similarity measure is able to elect representative neighbors from the very beginning and hence its improvement is slow.

For more clarity, Table 8 lists the best system in terms of both neighborhood set size and the cardinality of the common set while Figures 13 depict the results of all cardinalities of the common set for the two extreme values of neighborhood set size; that is, and . The results show the following points.(i)MDRS is the best in terms of all metrics for and all cardinality values of the common set while CVRS is the worst one. That means MDRS generates true neighbors from the very beginning.(ii)CBRS results go faster to that of MDRS than CVRS by increasing the cardinality of the common set.(iii)In terms of PCP, CVRS is the worst for all and all cardinalities of the common set.(iv)As we increase or the cardinality of the common set, the performances of all systems become very close. This indicates that the effect of similarity measure becomes less important compared to other factors.(v)Increasing for CS3 to CS5 gives some advantage to CBRS or CVRS. However, this advantage is very small compared to the results of the other systems. In this situation, other factors play more effectively for the recommender system.(vi)Increasing hides the ability of the similarity measure to select good neighbors. In this case, the contributions of so many neighbors compensate each other and consequently hide the differences between the similarity measures.(vii)Vertically with CS1 and CS2, MDRS and DWRS are the best in terms of PCP and MAE. At this stage, many neighbors who have been ranked at the top for CBRS and CVRS may have only one or two items in common with the active user. These neighbors are not true neighbors as their overlapping with the active user is very small.

7. Conclusions

One important way for enhancing the CRS accuracy is to select a similarity measure that produces a close set of neighbors. The results show that the effect of the similarity measure depends on the cardinality of the common set, as the system accuracy gets better with high values of this set. Moreover, this effect becomes less significant as we increase the number of the neighbors for the active user. In this case, large number of neighbors let them compensate each other and hence the differences between systems become very low.

The modified mean difference weights similarity measure outperforms other systems for many cases as it takes the ratings mean into consideration. In general, increasing the cardinality of the common set has two contradictory effects on the CRS performance. Negatively, it reduces the users’ coverage and hence the system could not help as many users as before. Positively, it enhances the CRS ability for identifying true neighbors and hence increases the system accuracy. The results show that some similarity measures outperform others for a specific cardinality of the common set. However, the role may change with another cardinality. For example, MDRS performs better than both CBRS and CVRS for CS1 and . However, the same system lost its advantage with CS4, CS5, and in terms of MAE. This indicates that the performance of similarity measures differ for different cardinalities of the common sets.

Many approaches have been done for alleviating the effect of small sets of common items, some of them try to predict missing items, and others try to devalue the contribution of the corresponding users. However, our view for future work is to propose new techniques that rely directly on the actual user data without any predictions to fill the missing values.

Competing Interests

The author declares that there are no competing interests.