-
Notifications
You must be signed in to change notification settings - Fork 20
Proposal: String representation/RleMatrix #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Ideally we would use a light weight representation (like the proposed RleListMatrix) that is more computable than a delimited string. |
I'm not sure whether or not the |
I think it would in this case, because of how the Rle compression helps the character case. |
Does it work in that context because a matrix is just a vector with a dim so you can actually compress long runs of sequential elements? I'm not sure I see how that manifests for the 'vector of lists' case but perhaps something clever can be done in the unlisting/relisting sense to take advantage of that. |
It will be efficient as long as the RleList is a CompressedList, meaning that internally there is just a flat vector that is lazily partitioned into list elements. |
In that case it may even be slightly more performant than the string version as there would be even fewer unique values over which a run may compress. Nice! |
I'd like to propose a couple of potential optimisations, particularly useful for cases where many entries are the same (e.g. AD: NA NA for GT: "./."). I'm working with a file which compresses nicely to .rds (~40MB) but is very large when in memory (17GB). This size becomes difficult to work with on a local machine or interactively.
Looping over the individual assays in a
CompressedVcf
I see that the largest ones are those which are matrices of lists (e.g. AD, AF, MBQ, ...). In theory (I think) they contain the same scale of information as GT, but their representation makes them much larger.I appreciate that this list format makes working with the data cleaner than storing the entries as delimited strings (as I believe they natively are in the VCF) which likely requires parsing the string and splitting into a list-like structure anyway, but this is a tradeoff between size and usability, and applies to the entire dataset even if a user is only interested in a subset.
Would there be interest in a representation of
CollapsedVcf
list-matrices as either delimited strings (in which case they could be stored asRleMatrix
for an additional compression boost; orRleListMatrix
(which doesn't exist, but here's an open issue: Bioconductor/DelayedArray#62)? I'm not well-versed enough in theRle
side to know ifRle
provides a benefit when stored in a list, but it's an option.Below is a comparison of object sizes for a toy example 'assay' constructed as a matrix of list elements and the size savings are potentially very large (in this case 1/62 the size).
Created on 2020-02-28 by the reprex package (v0.3.0)
I'm not across this package enough to have any insights into implementation issues or how this might affect other aspects of the inner workings (or user-side workings) but this approach reduces a 1.71GB
CollapsedVcf
into a 20MB object of the same class without destroying any information. Credit to @lawremi for guidance towards optimising the conversion and on structural advice.The text was updated successfully, but these errors were encountered: