top of page

Recommendation Systems for GoodReads

Which Recommendation Systems offer the most optimal performance for recommending books to readers on GoodReads?

Introduction: We received two datasets from GoodReads and will share insights derived from exploratory analysis, content based and collaborative filtering recommendation system, and association analysis. Lastly, we propose recommendations for GoodReads.

Data Exploration: The books dataset includes profile information on 10,000 books, such as ISBN, authors, language, publication year, etc. It also contains the count of total star ratings and total text reviews. The ratings dataset includes information on the book_id, user_id, and rating. On average, users rate 18 books. Only approximately 4,000 of the 53,424 users (less than 8%) have rated a book more than once. The highest number of books reviewed by a single user was tied at 200 by two users. There are 1,199 users (only 2%) who have rated at least 100 books, resulting in 165,905 ratings. See Appendix I for more details.

Foreign Languages: While most books in the GoodReads database are in English (8,730 or 87.3%), there are 19 other languages present, including one category for multi-language, across 186 books (1.86%). The language with the highest number of star-ratings is Spanish (2,494,249), followed by Arabic (1,151,016), and French (921,125).  The language with the highest average rating is Turkish (4.49), followed by multi-languages (4.32), and Polish (4.3). Languages with the highest text reviews are Arabic (136,164), Spanish (64,075), and French (30,757). The top three books with the highest number of five-star ratings are translations of American hit novels, such as Fahrenheit 451 (Spanish), Jurassic Park #1 (Spanish), and Christine (French). See Appendix II and Appendix III for additional insights. Spanish readers are most likely to leave a rating while Arabic readers are the most likely to spend time writing a review. However, there are 1,084 books without a language indication, which is 10.84%.

Cover Image: While one may assume that a book with no cover may be unappealing to read, this is not true. We learn that there is no difference in average rating between books that have a cover image and those that do not. While there are more books with a cover image (26,704 vs 13,318), coverless books have on average the same rating of 4.0. This suggests that book reviews and ratings are more important than a descriptive book cover.

Authors:  The most frequently appearing authors in GoodReads are Stephen King, with 60 books on GoodReads, followed by Nora Roberts. For a single book, Stephen King has achieved close to 800,000 ratings, making him the most influential author. However, the highest rated author on GoodReads is Bill Watterson, the author of the classic American cartoon, Calvin and Hobbes. Watterson’s lowest rated book is 4.65 and the highest is 4.80 (The Complete Calvin and Hobbes Collection). See Appendix IV for a full list.

Publication Year: There is a cluster of 25 books published around 500 BCE that are popular with today’s readers despite having been published over 2,000 years ago. These classics are rated higher than many books published in recent years. They represent less than 1% of books. 

Most Popular Books: The most popular books tend to appeal to younger audiences or read them for school. Unsurprisingly, the Harry Potter series leads the top 10 chart. Harry Potter #1 tops the chart as the book with the most 5 star ratings, followed by The Hunter Games #1. Harry Potter has an average rating of 4.44 while The Hunger Games has an average rating of 4.34. (While these are the most read, both are not as high as Bill Watterson’s lowest rated book, Calvin and Hobbes Treasury, at 4.65). They are also the two books with the highest counts of ratings, Hunger Games #1 (4,942,365) and Harry Potter #1 (4,800,065), making them the most read books on GoodReads. While most books are fantasy, other notable classic fiction titles such as To Kill A Mockingbird and Pride and Prejudice also appear as part of the top 10 GoodReads. 

Data PreProcessing: As part of the data exploration, we checked for duplicate records for the same book and found 69 duplicate observations for titles out of 100,000 titles, which is 0.006% of the dataset. Dupes are removed to ensure that we can develop a functioning utility matrix. To reduce calculation times, we filtered for users who have submitted at least 100 ratings. There are 1,199 users (only 2%) who have rated at least 100 books, leading to 165,905 ratings, which is 17% of the dataset. Because users may also have submitted ratings for books that are not in the books dataset, we keep ratings from books that have a reference in the dataset and remove duplicated ratings by a user, resulting in 164,525 ratings.While there were no missing values for ratings and authors, there were missing values for publication year (129), language code (10684), isbn (4939), and original title (4079), which we will turn into 0s for the MBA association analysis.

Modeling Methodology: We train and compare a set of random recommendations with user-based (UBCF) and item-based (IBCF) collaborative filtering models for different nn for UBCF models and k values for IBCF models, as well as different similarity measures with cosine or pearson. We consider cosine, as it is a measure of space and angle between two user vectors. When cosine is 0, the two users are 100% similar. Pearson correlation captures magnitude and orientation. For the training split, 90% of the data was used in training, and with only ratings of 4 or above. For each user in the test set, 15 ratings were given and the rest were predicted. As part of the evaluation, we considered the RMSE, MSE, and MAE and lastly reviewed an ROC plot.

To create the binary matrix for market basket analysis (MBA), we replace the NAs with 0 and any ratings greater than 0 to 1. To generate rules, we apply the Apriori package and algorithm, which assumes that any subset of a frequent itemset must also be frequent. We develop rules at the item level by indicating parameters for a level of support at 0.046 and confidence. Support refers to the frequency in which the two associated books are rated. Confidence notes the percentage of books rated containing A also contained B. Lift refers to the usefulness of the rule.

Analysis of Recommendation Systems: In terms of collaborative filtering systems, the IBCF model with k=25 performs best followed by the UBCF model with nn=25.


It is not surprising that there is no overlap of the recommendations from these two models. The UBCF model suggests books that other users similar to reader #1 have read ( teen fiction) while the IBCF suggests books that are individualized based on their history (books that appeal to an older audience).

IBCF_25: Holy Bible: King James Version, The Hound of the Baskervilles, The Polar Express, Seabiscuit: An American Legend, The Story of My Life

UBCF_25: Madeline, Beezus and Ramona (Ramona, #5), Cat on a Hot Tin Roof, The House of Hades (The Heroes of Olympus, #4), and Angelfall (Penryn & the End of Days, #1)

In performing MBA, we find that the majority of the 54 associations with support at 0.046 are related to the Sookie Stackhouse Vampire Mystery series. Lowering the support to 0.03, we see more titles related to the Stephanie Plum and Dresden File series. The lift results support the assumption that titles in a series tend to be read together.

When we lowered the confidence level to 0.01, we found popular titles and interesting results, including those for the Harry Potter series, which appears when confidence is at 0.03.

Titles (support is at 0.046, conf=0.08)Titles (support is at 0.046, conf=0.08)LiftThe Great GatsbyLittle Women #17.19Lord of the Rings #1Animal Farm7.19Memoirs of a GeishaThe Diary of a Young Girl7.19

However, with low confidence, we are hesitant to recommend MBA. Between the different models, we recommend GoodReads to use a collaborative filtering model for individual recommendations because MBA returns overwhelming results indicating that books are read in series, which is an obvious assumption. MBA may be good for a general understanding of the entire dataset, such as many GoodReads readers enjoy vampire mysteries. It also suggests that users who rate more than 100 books are in fact avid series readers and fans of science fiction.

Recommendation Systems Discussion: Content-based systems recommend books to user C similar to previous items rated highly by user C (the same person). As a business application, it can recommend books of a profile that was highly rated by the reader, which then builds a user profile. As an example, Harry Potter tends to be read in a series. Young readers may enjoy books such as The Hunger Games. However, some people may not provide demographic information. Challenges include being difficult to integrate users with no ratings, the system may not recommend a diverse selection of books, and requires a lot of item feature information as compared to collaborative filtering. 

Collaborative Filtering systems build recommendations based on data from a user’s rating history as well as similar decisions made by others, hence the term ‘collaborative’. It answers which books do users with interests similar to yours enjoy? As a business application, it offers personalized recommendations. It is computationally fast for large sparse data, accounting for both rating values and count of rated items. However, it recommends the popular books (rated by similar users) one at a time or multiple books that are unrelated unlike content based systems. 

Association rules help uncover sets of books that tend to be read or interact with each other, such as users who read A also read B. As a business application, it helps answer what items frequently appear together, what items do not tend to appear together, and why. Unlike collaborative filtering, it can list multiple items at a time. The challenge is that it does not take into account individual preferences, leading to generalized recommendations that may be ignored. Mining for rules also requires a dataset of transactions from all users, which is difficult as our model excludes ratings from users who have submitted less than 100 ratings. Only 2% of users qualify to be used in our model (likely series readers); we could not use 83% of ratings from our dataset.

The caveat with our current selective filtering approach is that somewhat popular books that were only rated by readers with under 100 ratings would have no data and will not be recommended. GoodReads can use cloud computing, which would allow us to filter for 20 books instead of 100. This should help diversify book recommendations. GoodReads can also consider click data. Not all recommendations should be based on ratings; some can be interaction based, such as recommending a book based on clicks on a book profile from similar users.

Project Gallery

bottom of page