-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
Here's a rough list of items I'm considering on the path to a DataSets-1.0 release. Several of these can and should be done prior to version 1.0 in case the APIs need to be adjusted a bit before the 1.0 release.
- Streamline access for small datasets by providing a "high level" API for use when working with a fully in-memory representation of the data which doesn't require the management of separate resources. ("Separate resources" would be things like managing an on-disk cache of the data, incremental async download/upload; that kind of thing.). Perhaps we can use the verbs
load()/save()for this — thinking of DataSets.jl as a new FileIO.jl, I think this would make sense. (Actually, this isn't breaking, so it doesn't need to wait for 1.0.) - Somehow allow
load()andsave()to return some "default type the user cares about" for convenience. For example, returning aDataFramefor a tabular dataset. This will require addressing the problems of dynamically loading Julia modules that were partially faced in Data Layers #17 - Consider the fate of
dataset()andopen()— currently theopen(dataset(...))idiom is a bit of an awkward double step and leads to some ambiguities. Perhaps we could repurposedataset(name)to mean whatopen(dataset(name))currently does? - Perhaps unexport
DataSet? Users should rarely need to use this directly. - Storage API; finalize how we're going to deal with "resources" which back a lazily downloaded dataset: cache mangement, etc. We could adopt the approach from ResourceContexts.jl, for example using
ctx = ResourceContext(); x = dataset(ctx, "name"); ...; close(ctx). Or from ContextManagers.jl in the stylectx = dataset("name"); x = value(ctx); close(ctx). (Both of these have macros for syntactic shortcuts.) - Improve and formalize the
BlobTreeAPI - Figure out how we can integrate with
FilePathsBaseand whether there's a type which can implement theAbstractPathinterface well enough to allow things likeCSV.read(x)to work for somex. Perhaps we need aDataSpecificationtype for the URI-like concept currently called "dataspec" in the codebase? We could haveCSV.read(data"foo?version=2#a/b")? - Consider deprecating and removing the "data entry point" stuff
@datarunand@datafunc. I feel introducing these was premature and the semantics is probably not quite right. We can bring something similar back in the future if it seems like a good idea. - Fix some issues with Data.toml
- Consider representing
[datasets]section as a dictionary mapping names to configs, not as an array withnameproperties. This is safe becauseTOMLsyntax does allow arbitrary strings as section names. (Note that either representation is valid when a givenDataSetis specifically tied to a project.) - Move data storage driver type outside of the storage section?
- Fix up the mess with
@__DIR__templating somehow (fixed in DataSet configuration #46)
- Consider representing
- Dataset resolution
- Rename
DataSets.PROJECTtoDataSets.PROJECTSif this is always aStackedDataProject. - Consider whether we really want a data stack vs how "data authorities" could perhaps work (ie, the authority section in the URI; eg, juliahub.com)
- Rename
jeremiedb and CarloLucibello
Metadata
Metadata
Assignees
Labels
No labels