Utility
It represents the value that users receives in being recommended -> if the user enjoys the recommended items, he/she received useful recommendations. One way to measure utility is by evaluating the rating the user gives to predicted items after consuming them, but this could be costly for an online evaluation. In offline evaluation we can use predictive accuracy metrics like:
- Mean absolute error (MAE), which consists in the difference between the ratings predicted by the recommender and given by the users;
- Root mean squared error (RMSE), average RMSE, average MAE, mean squared error;
- Precision (number of consumed or rated items in the recommendation list) and recall (number of consumed items in the recommendation list out of the total number of items the user consumed). Precision@N and Recall@N stands for the size of the recommendation list.
- F1-score: it is the harmonic mean of the precision and recall. The more generic $F_{\beta}$ score applies additional weights, valuing one of precision or recall more than the other. The highest possible value of an F-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0, if both precision and recall are zero. The F1 score is a popular performance measure for classification and often preferred over, for example, accuracy when data is unbalanced. However, the F-measures do not take true negatives into account, hence measures such as the Matthews correlation coefficient, Informedness or Cohen's kappa may be preferred to assess the performance of a binary classifier.
- ROC curves measure the rate of items that the user likes in the recommendation list. Differently from error, precision and recall, this method accentuate items that were suggested but the user disliked.
- Ranking metrics -> they considere the fact that the users don't browse through all the recommended items. R-score metric for example consider a deduction in the value of recommendations according to the rank position (top ranked items are valued more), its formula is: $util(R_u)= rank(R_u)= \sum_{j=1}^{|R_u|} \dfrac {max(r(i_j)-d,0)}{2^{\dfrac{j-1}{a-1}}}$. Other ranking scores are Kendall and Spearman rank correlation and Normalized Distance-based Performance Measure;
- Online evaluation: Click-through-rate (CTR) is calculated as the ration of clicked/interacted recommended items out of the number of items recommended. Retention measures the impact of the recommender system in keeping users consuming items or using the system, it is often used with A/B testing.