Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileIndex + New File Structure + File Prefetch + Parallel Write #8

Merged
merged 19 commits into from
Sep 3, 2024

Conversation

marino39
Copy link
Contributor

No description provided.

@marino39 marino39 changed the title FileIndex + New File Structure FileIndex + New File Structure + File Prefetch + Parallel Write Aug 29, 2024
@marino39 marino39 marked this pull request as ready for review August 29, 2024 13:24
})
func (r *reader[T]) prefetchNextFile(ctx context.Context) {
if r.currFileIndex+1 < len(r.fileIndex.Files()) {
go r.prefetchFile(ctx, r.fileIndex.At(r.currFileIndex+1))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there limits for the number of concurrent prefetchFile tasks spawned simultaneously?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's always the next file. So it's kind of limited by that. So no more than 1 prefetch can be spawned.

return r.readFile(index)
}
}
_ = file.Prefetch(pCtx, r.fs)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error here is neither logged nor handled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fire and forget background task. If it fails the file will be opened as usual instead of using prefetched buffer. No need for logging or handling.

Please have a look at File.Open:

ethwal/common.go

Lines 167 to 176 in 57ad4b3

func (f *File) Open(ctx context.Context, fs storage.FS) (io.ReadCloser, error) {
f.mu.Lock()
prefetchedRdr := f.prefetched()
f.mu.Unlock()
if prefetchedRdr != nil {
return prefetchedRdr, nil
}
return f.open(ctx, fs)
}

@xiam
Copy link

xiam commented Aug 29, 2024

Looking great overall, I added a few comments

@marino39
Copy link
Contributor Author

Looking great overall, I added a few comments

Thanks, I will address tomorrow.

@marino39 marino39 requested a review from xiam August 30, 2024 08:17
Copy link

@xiam xiam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments! LGTM

Copy link
Contributor

@VojtechVitek VojtechVitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice and clean code. 🎉

Few questions / minor feedback :)

Comment on lines +119 to +132
// -- ethwal
// |-- 000563
// | |-- 000256
// | | |-- 000124
// | | | |-- 28f55a4df523b3ef28f55a4df523b3ef28f55a4df523b3ef28f55a4df523b3ef <- ethwal file
// | | |-- 000278
// | | | |-- 28f55a4df523b3ef28f55a4df523b3ef28f55a4df523b3ef28f55a4df523b3dd <- ethwal file
// | |-- 000025
// | | |-- 000967
// | | | |-- 28f55a4df523b3ef28f55a4df523b3ef28f55a4df523b3dd28f55a4df523b3ef <- ethwal file
// |-- .fileIndex
//
// The data structure ensures that there is no more than 1000 directories per level. The filename is a sha-256 hash of
// the first and last block numbers. The hash is used to distribute files evenly across directories.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the main reason for creating this 3-level folder structure?

"no more than 1000 directories per level" -- is this some limitation of GCP buckets? or is it so we don't need to list dirs with paging? anything else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local filesystem start to have a issues with listing files ~20k files. It takes a while. The problem becomes even more visible when we are trying to list Google Storage bucket contents. It took a couple of minutes with aprox. 40k files.

In additon to that MacOS have been hanging when I have been working in proximity of ethwal directory.

Peter suggested to put files in directories with each of them having limited number of files that can be worked with easily.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, thanks for the insights :)

@marino39 marino39 merged commit c6fc296 into master Sep 3, 2024
1 check passed
@marino39 marino39 deleted the fileIndex branch September 3, 2024 08:15
marino39 added a commit that referenced this pull request Sep 10, 2024
FileIndex + New File Structure + File Prefetch + Parallel Write
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants