-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache godoc package data #70
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with a couple of comments.
page.go
Outdated
}() | ||
|
||
wantResps := 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do this below as part of the for loop:
for r, wantResps := 0, 2; r < wantResps; {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've implemented Gustavo's multiple-select suggestion instead.
page.go
Outdated
@@ -217,6 +217,43 @@ type packageData struct { | |||
GitTreeName string | |||
} | |||
|
|||
type cacheEntry struct { | |||
t time.Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be inclined to call this timestamp or ts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just have Timestamp as part of packageData itself and get rid of this type altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone with Timestamp here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the description of the change. It says it's caching package data for rendering pages, but that's already cached. What's the problem being solved in more detail?
Update: Sorry, I'm crazy. I mixed up the work I've done on the documentation proxy, which does have a nice cache, with this. Thanks for adding the cache. I'll send a review next.
Sorry, I'm crazy. I mixed up the work I've done on the documentation proxy, which does have a nice cache, with this. Thanks for adding the cache. I'll send a review next. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea and implementation look good. Here are a few nitpicks on the details:
page.go
Outdated
r := 0 | ||
for r < wantResps { | ||
select { | ||
case <-gotResp: | ||
case data.PackageName = <-name: | ||
r++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have independent channels in the middle of the logic above, it's more likely we might get a bug some day about two responses being received. Instead of counting responses, let's just have something simpler and more resilient:
timeout := time.After(3 * time.Second)
select {
case data.PackageName := <-name:
case <-timeout:
}
select {
case data.Synopsis := <-synopsis:
case <-timeout:
}
Please double check, but I think this achieves the same with less logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented, and it looks good in my testing.
page.go
Outdated
@@ -227,7 +264,7 @@ type SearchResults struct { | |||
|
|||
var regexpPackageName = regexp.MustCompile(`<h2 id="pkg-overview">package ([\p{L}_][\p{L}\p{Nd}_]*)</h2>`) | |||
|
|||
func renderPackagePage(resp http.ResponseWriter, req *http.Request, repo *Repo) { | |||
func fetchPackageData(repo *Repo) *packageData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a nice change by itself.
page.go
Outdated
@@ -217,6 +217,43 @@ type packageData struct { | |||
GitTreeName string | |||
} | |||
|
|||
type cacheEntry struct { | |||
t time.Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just have Timestamp as part of packageData itself and get rid of this type altogether.
page.go
Outdated
} | ||
|
||
var cache map[string]*cacheEntry = make(map[string]*cacheEntry) | ||
var cacheLock sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are globals, so we need a more clear name for them.
Perhaps, using your theme:
- pageDataCache
- pageDataCacheTTL
- pageDataCacheLock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following my changes to stop caching repo, I've renamed this packageDataCache, etc.
page.go
Outdated
|
||
const cacheTTL = 1 * time.Hour | ||
|
||
func cacheGet(name string) *packageData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.. we can also reverse notation so it reads like the other methods:
- getPageData
- setPageData
We might also optionally rename packageData to pageData, so it's all shorter and more aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've stuck with packageData, since it now only contains Name and Synopsis, but renamed the functions verb-first.
page.go
Outdated
var cache map[string]*cacheEntry = make(map[string]*cacheEntry) | ||
var cacheLock sync.RWMutex | ||
|
||
const cacheTTL = 1 * time.Hour |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will have a usability impact, in that people will push their git repository and refresh the page expecting a change. Can we start on the conservative side, caching with, say, 1 minute, and then grow from there as we learn more about usage patterns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken a look at the access logs and based on my very crude hourly histograms it looks like even the most popular package page (yaml.v2) is only visited once every five or so minutes on average, so we may not want to bother with this at all.
However, if we do: I just noticed that repo is passed in fresh every time (it's fetched by handler() in main.go) so caching it at all was pretty silly of me. We can have the best of both worlds by caching only the godoc data (name and synopsis) and using repo to calculate the versions, which should always be current. (We can always add a repoCache later with a 1-minute TTL later if we decide we'd like to speed this up further.)
I've implemented this and will push shortly.
page.go
Outdated
if time.Since(entry.t) < cacheTTL { | ||
return entry.pd | ||
} | ||
log.Println("ignoring expired cache entry for", name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit boring to read on the logs so frequently. It feels more like a debug message than a log one. Most accesses are not for the pages but via git, so when we get the page it'll always come with one of these and ...
page.go
Outdated
if time.Since(entry.t) < cacheTTL { | ||
return | ||
} | ||
log.Println("updating expired cache entry for", name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... one of those.
I suggest dropping both for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed both.
bc7ba44
to
2487c2a
Compare
Use the fresh GitHub data to calculate versions so that pushes are seen immediately.
2487c2a
to
cd92e43
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, except for the TTL:
page.go
Outdated
var packageDataCache map[string]*packageData = make(map[string]*packageData) | ||
var packageDataCacheLock sync.RWMutex | ||
|
||
const packageDataCacheTTL = 1 * time.Hour |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, can we please drop this to a much smaller time window? I suggest 1 minute at most. The issue here is usability: people pushing their own PRs will refresh this page to see it working, and it won't. If this is barely used at all, then we don't even need caching. Maybe let's drop that idea altogether then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, with that same rationale, let's do 1 minute please. This matches the other PR, and is enough to catch fast refresh action without actually making it too awkward for people finding out if their branch logic is being considered by the service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current version of this branch now only caches the GoDoc data, so the versions should always be fresh. It looks like GoDoc refreshes once a day (https://godoc.org/-/about) although if someone invokes a manual refresh they may notice that it no longer updates immediately.
I'll make it 1 minute and we can go from there. I'm interested in exporting metrics from the gopkg server, and cache stats would certainly be among them (actually, so far that's all I've thought of...).
Now that gopkg.in is hosted in the UK, godoc requests made while rendering the human-readable package pages take a lot longer (often a second or more). This branch adds a cache with a TTL of 1 hour.
Also, mainly as an exercise, I've removed dataMutex and replaced gotResp with a pair of string channels.