Zero copy loading from kastore #1977

jeromekelleher · 2021-12-02T11:22:38Z

As discussed in #1970 the current default behaviour on calling tsk_table_collection_load is:

Call kastore_load() with flags=LOAD_ALL
For each table, call tsk_x_table_set_columns (or similar for reference sequence)

This means that we have two copies of the table/reference sequence data present in memory, and this can be an issue for downstream users, if there are very large files.

We could, in principle, avoid this in the following way:

Add a method to kastore to return the pointer to the memory buffer returned by malloc that is used to store the file contents when we specify LOAD_ALL (we read in the whole file at once, or nearly all anyway, we have to read the header first to find out how big it is).
Add an option to kastore_free/init to tell it to not free this buffer when the kastore is freed.
Add an option to tsk_x_table_set_columns to tell it to not copy the input data, but to just use the specified pointers directly. This would need to set some internal flag also to tell it to not free these pointers.
Similarly, the various set_string methods would need an option TSK_BORROW_REFERENCE. This would then handle the reference sequence problem.
We'd need to do something similar for the top level metadata in the table collection.

This is all straightforward enough, but the difficulty occurs when someone tries to update this table collection. The simplest thing would be to say that it's strictly read-only. However, this does mean that if someone wanted to update these large models they'd have to take a copy, and then we'd be back to square one in terms of memory high-water mark.

So, to do this, we'd need to implement the C-level state-machine enforcing read-only behaviour on tables and table collections (and reference sequences). Cf. #1906, #993, #1917

@bhaller - what do you think? Would this work for your use-case, or would you need to be able to subsequently write to the tables that you read in?

The text was updated successfully, but these errors were encountered:

bhaller · 2021-12-02T16:15:00Z

I don't think SLiM would use this facility, no; we need to be able to modify the tables. As far as I can tell, the copy from kastore is unavoidable in general. In #1970 I'm specifically concerned with avoiding the copy of the reference sequence, because in that specific case (a) the reference sequence can be extremely large (the tables could too, I suppose, but a 1 GB – 10 GB reference sequence is commonplace), (b) the copy can be avoided because SLiM doesn't want/need tskit to keep the reference sequence itself, and (c) SLiM's own copy of the reference sequence can be made directly from kastore, and is 2-bits-per-base, so that avoids having double the memory overhead.

So I don't think the idea of making a read-only table collection would be a useful solution to this problem, for SLiM at least. I'd actually suggest closing this issue and re-opening #1970; that issue contains a lot of discussion that describes the specific situation with respect to the reference sequence in SLiM, and proposes a solution that I do think is workable. Let's go back to that?

jeromekelleher · 2021-12-03T08:35:14Z

Thanks @bhaller. I can see how this wouldn't be that useful for SLiM, but it'll be crucial for bringing the memory overhead down in the common "read a tree sequence and analyse it" case.

But, I'd forgotten about the "assign this reference sequence into the tables without another copy" discussed in #1979, which we also want to deal with (and are doing so in #1990).

bhaller · 2021-12-03T14:51:38Z

OK, thanks @jeromekelleher. I thought you were proposing this issue solely as a fix for SLiM to address #1970; glad to hear otherwise, no worries. Yes, the cluster of issues all in the air simultaneously is a bit intense right now! :-> OK, I'll tune out of this one since it's not relevant to SLiM. :->

jeromekelleher · 2021-12-03T14:56:30Z

Heating up for 1.0! Hopefully things will go nice and quiet soon..

jeromekelleher · 2021-12-03T15:38:26Z

After discussing with @benjeffery just now, we've come up with a solution that is much simpler, and doesn't require setting anything into read-only mode. Summarised in #2006.

I guess we can close this issue once we have zero-copy kastore->tables without copying.

This requires kastore version 2.1.0 (https://github.com/tskit-dev/kastore/milestone/9)

jeromekelleher · 2021-12-05T16:26:49Z

Just a note to follow up on here in light of #2016, and recent developments. After realising that all the x_table_set_columns methods have const arguments we realised that it's not possible to take owership of a buffer by this route. So, we'll use the new x_table_takeset_columns methods instead.

An advantage of using these new methods is that we have an opportunity to get the default semantics for columns right, so that, for example, any missing ragged column pairs can be malloced with the appropriate empty buffers.

jeromekelleher · 2022-03-14T09:02:19Z

I think we can close this now @benjeffery ?

jeromekelleher added C API Issue is about the C API Performance This issue addresses performance, either runtime or memory labels Dec 2, 2021

This was referenced Dec 2, 2021

Add options flag to all string setting methods #1979

Closed

avoid double copies of the reference sequence? #1970

Closed

jeromekelleher mentioned this issue Dec 2, 2021

Add options flags to string setters #1990

Closed

jeromekelleher mentioned this issue Dec 8, 2021

Column takeset #2032

Merged

benjeffery closed this as completed Mar 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero copy loading from kastore #1977

Zero copy loading from kastore #1977

jeromekelleher commented Dec 2, 2021 •

edited

Loading

bhaller commented Dec 2, 2021 •

edited

Loading

jeromekelleher commented Dec 3, 2021

bhaller commented Dec 3, 2021

jeromekelleher commented Dec 3, 2021

jeromekelleher commented Dec 3, 2021

jeromekelleher commented Dec 5, 2021

jeromekelleher commented Mar 14, 2022

Zero copy loading from kastore #1977

Zero copy loading from kastore #1977

Comments

jeromekelleher commented Dec 2, 2021 • edited Loading

bhaller commented Dec 2, 2021 • edited Loading

jeromekelleher commented Dec 3, 2021

bhaller commented Dec 3, 2021

jeromekelleher commented Dec 3, 2021

jeromekelleher commented Dec 3, 2021

jeromekelleher commented Dec 5, 2021

jeromekelleher commented Mar 14, 2022

jeromekelleher commented Dec 2, 2021 •

edited

Loading

bhaller commented Dec 2, 2021 •

edited

Loading