Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache godoc package data #70

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

vmpjdc
Copy link
Contributor

@vmpjdc vmpjdc commented Oct 23, 2019

Now that gopkg.in is hosted in the UK, godoc requests made while rendering the human-readable package pages take a lot longer (often a second or more). This branch adds a cache with a TTL of 1 hour.

Also, mainly as an exercise, I've removed dataMutex and replaced gotResp with a pair of string channels.

Copy link

@4a6f656c 4a6f656c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a couple of comments.

page.go Outdated
}()

wantResps := 2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do this below as part of the for loop:

for r, wantResps := 0, 2; r < wantResps; {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented Gustavo's multiple-select suggestion instead.

page.go Outdated
@@ -217,6 +217,43 @@ type packageData struct {
GitTreeName string
}

type cacheEntry struct {
t time.Time

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be inclined to call this timestamp or ts.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just have Timestamp as part of packageData itself and get rid of this type altogether.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone with Timestamp here.

Copy link
Owner

@niemeyer niemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the description of the change. It says it's caching package data for rendering pages, but that's already cached. What's the problem being solved in more detail?

Update: Sorry, I'm crazy. I mixed up the work I've done on the documentation proxy, which does have a nice cache, with this. Thanks for adding the cache. I'll send a review next.

@niemeyer
Copy link
Owner

Sorry, I'm crazy. I mixed up the work I've done on the documentation proxy, which does have a nice cache, with this. Thanks for adding the cache. I'll send a review next.

Copy link
Owner

@niemeyer niemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea and implementation look good. Here are a few nitpicks on the details:

page.go Outdated
r := 0
for r < wantResps {
select {
case <-gotResp:
case data.PackageName = <-name:
r++
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have independent channels in the middle of the logic above, it's more likely we might get a bug some day about two responses being received. Instead of counting responses, let's just have something simpler and more resilient:

timeout := time.After(3 * time.Second)
select {
case data.PackageName := <-name:
case <-timeout:
}
select {
case data.Synopsis := <-synopsis:
case <-timeout:
}

Please double check, but I think this achieves the same with less logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented, and it looks good in my testing.

page.go Outdated
@@ -227,7 +264,7 @@ type SearchResults struct {

var regexpPackageName = regexp.MustCompile(`<h2 id="pkg-overview">package ([\p{L}_][\p{L}\p{Nd}_]*)</h2>`)

func renderPackagePage(resp http.ResponseWriter, req *http.Request, repo *Repo) {
func fetchPackageData(repo *Repo) *packageData {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a nice change by itself.

page.go Outdated
@@ -217,6 +217,43 @@ type packageData struct {
GitTreeName string
}

type cacheEntry struct {
t time.Time
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just have Timestamp as part of packageData itself and get rid of this type altogether.

page.go Outdated
}

var cache map[string]*cacheEntry = make(map[string]*cacheEntry)
var cacheLock sync.RWMutex
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are globals, so we need a more clear name for them.

Perhaps, using your theme:

  • pageDataCache
  • pageDataCacheTTL
  • pageDataCacheLock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following my changes to stop caching repo, I've renamed this packageDataCache, etc.

page.go Outdated

const cacheTTL = 1 * time.Hour

func cacheGet(name string) *packageData {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.. we can also reverse notation so it reads like the other methods:

  • getPageData
  • setPageData
    We might also optionally rename packageData to pageData, so it's all shorter and more aligned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've stuck with packageData, since it now only contains Name and Synopsis, but renamed the functions verb-first.

page.go Outdated
var cache map[string]*cacheEntry = make(map[string]*cacheEntry)
var cacheLock sync.RWMutex

const cacheTTL = 1 * time.Hour
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will have a usability impact, in that people will push their git repository and refresh the page expecting a change. Can we start on the conservative side, caching with, say, 1 minute, and then grow from there as we learn more about usage patterns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken a look at the access logs and based on my very crude hourly histograms it looks like even the most popular package page (yaml.v2) is only visited once every five or so minutes on average, so we may not want to bother with this at all.

However, if we do: I just noticed that repo is passed in fresh every time (it's fetched by handler() in main.go) so caching it at all was pretty silly of me. We can have the best of both worlds by caching only the godoc data (name and synopsis) and using repo to calculate the versions, which should always be current. (We can always add a repoCache later with a 1-minute TTL later if we decide we'd like to speed this up further.)

I've implemented this and will push shortly.

page.go Outdated
if time.Since(entry.t) < cacheTTL {
return entry.pd
}
log.Println("ignoring expired cache entry for", name)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit boring to read on the logs so frequently. It feels more like a debug message than a log one. Most accesses are not for the pages but via git, so when we get the page it'll always come with one of these and ...

page.go Outdated
if time.Since(entry.t) < cacheTTL {
return
}
log.Println("updating expired cache entry for", name)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... one of those.

I suggest dropping both for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed both.

@vmpjdc vmpjdc force-pushed the package-data-cache branch 2 times, most recently from bc7ba44 to 2487c2a Compare October 24, 2019 00:22
@vmpjdc vmpjdc force-pushed the package-data-cache branch from 2487c2a to cd92e43 Compare October 24, 2019 01:22
Copy link
Owner

@niemeyer niemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, except for the TTL:

page.go Outdated
var packageDataCache map[string]*packageData = make(map[string]*packageData)
var packageDataCacheLock sync.RWMutex

const packageDataCacheTTL = 1 * time.Hour
Copy link
Owner

@niemeyer niemeyer Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, can we please drop this to a much smaller time window? I suggest 1 minute at most. The issue here is usability: people pushing their own PRs will refresh this page to see it working, and it won't. If this is barely used at all, then we don't even need caching. Maybe let's drop that idea altogether then?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, with that same rationale, let's do 1 minute please. This matches the other PR, and is enough to catch fast refresh action without actually making it too awkward for people finding out if their branch logic is being considered by the service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current version of this branch now only caches the GoDoc data, so the versions should always be fresh. It looks like GoDoc refreshes once a day (https://godoc.org/-/about) although if someone invokes a manual refresh they may notice that it no longer updates immediately.

I'll make it 1 minute and we can go from there. I'm interested in exporting metrics from the gopkg server, and cache stats would certainly be among them (actually, so far that's all I've thought of...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants