Skip to content

Zero copy loading from kastore #1977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jeromekelleher opened this issue Dec 2, 2021 · 7 comments
Closed

Zero copy loading from kastore #1977

jeromekelleher opened this issue Dec 2, 2021 · 7 comments
Labels
C API Issue is about the C API Performance This issue addresses performance, either runtime or memory

Comments

@jeromekelleher
Copy link
Member

jeromekelleher commented Dec 2, 2021

As discussed in #1970 the current default behaviour on calling tsk_table_collection_load is:

  • Call kastore_load() with flags=LOAD_ALL
  • For each table, call tsk_x_table_set_columns (or similar for reference sequence)

This means that we have two copies of the table/reference sequence data present in memory, and this can be an issue for downstream users, if there are very large files.

We could, in principle, avoid this in the following way:

  • Add a method to kastore to return the pointer to the memory buffer returned by malloc that is used to store the file contents when we specify LOAD_ALL (we read in the whole file at once, or nearly all anyway, we have to read the header first to find out how big it is).
  • Add an option to kastore_free/init to tell it to not free this buffer when the kastore is freed.
  • Add an option to tsk_x_table_set_columns to tell it to not copy the input data, but to just use the specified pointers directly. This would need to set some internal flag also to tell it to not free these pointers.
  • Similarly, the various set_string methods would need an option TSK_BORROW_REFERENCE. This would then handle the reference sequence problem.
  • We'd need to do something similar for the top level metadata in the table collection.

This is all straightforward enough, but the difficulty occurs when someone tries to update this table collection. The simplest thing would be to say that it's strictly read-only. However, this does mean that if someone wanted to update these large models they'd have to take a copy, and then we'd be back to square one in terms of memory high-water mark.

So, to do this, we'd need to implement the C-level state-machine enforcing read-only behaviour on tables and table collections (and reference sequences). Cf. #1906, #993, #1917

@bhaller - what do you think? Would this work for your use-case, or would you need to be able to subsequently write to the tables that you read in?

@jeromekelleher jeromekelleher added C API Issue is about the C API Performance This issue addresses performance, either runtime or memory labels Dec 2, 2021
@bhaller
Copy link

bhaller commented Dec 2, 2021

I don't think SLiM would use this facility, no; we need to be able to modify the tables. As far as I can tell, the copy from kastore is unavoidable in general. In #1970 I'm specifically concerned with avoiding the copy of the reference sequence, because in that specific case (a) the reference sequence can be extremely large (the tables could too, I suppose, but a 1 GB – 10 GB reference sequence is commonplace), (b) the copy can be avoided because SLiM doesn't want/need tskit to keep the reference sequence itself, and (c) SLiM's own copy of the reference sequence can be made directly from kastore, and is 2-bits-per-base, so that avoids having double the memory overhead.

So I don't think the idea of making a read-only table collection would be a useful solution to this problem, for SLiM at least. I'd actually suggest closing this issue and re-opening #1970; that issue contains a lot of discussion that describes the specific situation with respect to the reference sequence in SLiM, and proposes a solution that I do think is workable. Let's go back to that?

@jeromekelleher
Copy link
Member Author

Thanks @bhaller. I can see how this wouldn't be that useful for SLiM, but it'll be crucial for bringing the memory overhead down in the common "read a tree sequence and analyse it" case.

But, I'd forgotten about the "assign this reference sequence into the tables without another copy" discussed in #1979, which we also want to deal with (and are doing so in #1990).

@bhaller
Copy link

bhaller commented Dec 3, 2021

OK, thanks @jeromekelleher. I thought you were proposing this issue solely as a fix for SLiM to address #1970; glad to hear otherwise, no worries. Yes, the cluster of issues all in the air simultaneously is a bit intense right now! :-> OK, I'll tune out of this one since it's not relevant to SLiM. :->

@jeromekelleher
Copy link
Member Author

Heating up for 1.0! Hopefully things will go nice and quiet soon..

@jeromekelleher
Copy link
Member Author

After discussing with @benjeffery just now, we've come up with a solution that is much simpler, and doesn't require setting anything into read-only mode. Summarised in #2006.

I guess we can close this issue once we have zero-copy kastore->tables without copying.

This requires kastore version 2.1.0 (https://github.com/tskit-dev/kastore/milestone/9)

@jeromekelleher
Copy link
Member Author

Just a note to follow up on here in light of #2016, and recent developments. After realising that all the x_table_set_columns methods have const arguments we realised that it's not possible to take owership of a buffer by this route. So, we'll use the new x_table_takeset_columns methods instead.

An advantage of using these new methods is that we have an opportunity to get the default semantics for columns right, so that, for example, any missing ragged column pairs can be malloced with the appropriate empty buffers.

@jeromekelleher
Copy link
Member Author

I think we can close this now @benjeffery ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C API Issue is about the C API Performance This issue addresses performance, either runtime or memory
Projects
None yet
Development

No branches or pull requests

3 participants