Skip to content

Commit 6d9b1d5

Browse files
committed
board game recommendation post
1 parent f4ddcd6 commit 6d9b1d5

File tree

2 files changed

+114
-0
lines changed

2 files changed

+114
-0
lines changed
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
---
2+
layout: layout
3+
title: A Board Game Recommendation Engine
4+
---
5+
6+
I decided to create a recommendation engine for board games, using data scraped
7+
from [boardgamegeek.com][bgg]. My code is available [on GitHub][gh], but be
8+
warned---I managed to get my IP blocked from BGG while collecting data. In my
9+
defense:
10+
11+
1. They have a [wiki page about data mining][dm] the site, and
12+
1. I was making 1 request per second, and downloaded less than 100MB of data.
13+
I could do that by hand.
14+
15+
My apologies to the Board Game Geek Powers That Be for my insolent behavior. I
16+
promise to never do it again. Or I'll at least do it (even) slower.
17+
18+
### The data
19+
20+
Before I was blocked, I managed to scrape the ratings lists from 1,700 users who
21+
had rated at least one game. Over 1,000 users had rated at least 10 games. There
22+
are almost l2,000 games in the database, although I only used the 1,000
23+
most-rated games for the user similarity measure (described below). All the web
24+
scraping code is in Python 3, and I also used Python to parse the XML and write
25+
the matrix of ratings to [HDF5][hd].
26+
27+
### The engine
28+
29+
A recommendation engine needs to do two things:
30+
31+
1. Find other users who like the things you like, and
32+
1. Figure out what they really like that you've not yet rated.
33+
34+
However, the problem can be considered more simply as predicting the rating a
35+
user would give to a game they've not yet rated. Once you've made those
36+
predictions, you can simply recommend the games with the highest predicted
37+
ratings.
38+
39+
I followed the outline described [in this post][dp], and implemented the
40+
prediction engine in R. Users are represented as a vector of ratings, and the
41+
similarity of two users is the cosine between their vectors. Then, a rating is
42+
predicted by taking the average of other users' ratings, weighted by their
43+
similarity with the user in question.
44+
45+
The issue of missing data is not addressed in the post linked above. I wrote a
46+
modified cosine function to handle NA values. By default, the cosine function
47+
returns NA if any of the inputs are NA, but I wanted it to project onto the
48+
largest subspace with no missing data for a given pair of users, and calculate
49+
the cosine of the projected vectors.
50+
51+
### Results
52+
53+
For testing purposes, my recommendation function predicts ratings for all games
54+
for a given user. It ignores the user's true ratings in the process, which has
55+
no effect on making recommendations, since you wouldn't recommend a game the
56+
user has already rated, but allows honest comparisons between true and predicted
57+
ratings for testing purposes.
58+
59+
Here are the 20 games with the highest predicted rating for a particular user,
60+
along with the mean and standard deviation of their ratings.
61+
62+
63+
> r[1:20,]
64+
game predictedRating rating
65+
1 21 8.248609 8.0
66+
2 91 8.132131 9.0
67+
3 6 8.110919 8.5
68+
4 59 8.105064 8.5
69+
5 4 8.026672 9.0
70+
6 7 7.979619 9.0
71+
7 36 7.939304 9.0
72+
8 48 7.935049 9.0
73+
9 52 7.922401 8.0
74+
10 29 7.888088 9.5
75+
11 45 7.859293 9.0
76+
12 62 7.833052 9.0
77+
13 153 7.832915 9.0
78+
14 19 7.832199 7.5
79+
15 492 7.816715 NaN
80+
16 80 7.812988 NaN
81+
17 81 7.763401 9.0
82+
18 8 7.755711 9.0
83+
19 139 7.750971 8.5
84+
20 132 7.749858 NaN
85+
86+
> mean(r$rating[!is.na(r$rating)])
87+
[1] 6.914315
88+
> var(r$rating[!is.na(r$rating)])^0.5
89+
[1] 1.489091
90+
91+
The list is sorted by predicted rating, so it's satisfying to see high true
92+
ratings at the top of the list.
93+
94+
### Next steps
95+
96+
There are a couple things I'd like to do.
97+
98+
1. Do some quantitative analysis of my results. I'm not ready to declare
99+
success as things stand currently---I need to consider what objective
100+
measures of success are appropriate.
101+
1. Do some qualitative analysis of my results: get some board game geek friends
102+
to send me a list of their favorite games, so I can make recommendations to
103+
them, and see what they think of the results.
104+
1. Use the id-to-name mappings that I stored away, so the results are a bit
105+
more interesting to look at.
106+
1. Try another method for comparison, such as a k-nearest neighbors algorithm.
107+
1. Get myself unblocked.
108+
109+
[bgg]: http://www.boardgamegeek.com
110+
[gh]: https://github.com/JStech/bggrec
111+
[dm]: http://boardgamegeek.com/wiki/page/Data_Mining
112+
[hd]: http://www.hdfgroup.org/HDF5/
113+
[dp]: http://www.dataperspective.info/2014/05/basic-recommendation-engine-using-r.html

style.css

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ body {
44
color: #c8cbce;
55
width: 800px;
66
margin: 0 auto 0 auto;
7+
line-height:130%;
78
}
89

910
a:link {

0 commit comments

Comments
 (0)