Skip to content

Thoughts on getting many outputs #96

@mslw

Description

@mslw

Suppose a compute specification declares and produces multiple outputs. How would get behave when more than one of these output files are requested?

AFAIK git annex get acts sequentialy, ie. retrieves files one-by-one. This means that the remake special remote would be called upon as many times as there are files to get. Each time it is called, it would go through the steps of provision - execute - collect (each time producing multiple outputs but collecting only one). The specific call is TRANSFER RETRIEVE.

For example: if a specification has two outputs, and we want to get both of them, we end up provisioning the inputs twice, and running the computation twice.

This is inefficient, however, this is probably a necessary compromise: we can not hook deep inside the logic of get. Trading off storage space for compute time is inherent in compute-on-demand. If computations are quick, and provision can use local files (i.e. avoid slow re-downloads) this should not cause any noticeable discomfort. Nevertheless, this is a limitation.

Could that be improved upon? Probably not without different tradeoffs.

I re-read the external special remote protocol description and I don't think there is a better place for provisioning & execution than the currently used TRANSFER_RETRIEVE.

My one thought was that the outputs could be somehow cached. So instead of dropping the entire secondary worktree with its contents after collecting the requested file (key), we could store other outputs (if multiple outputs are declared) in some sort of temporary location - either within the dataset or in the temporary directory. Any subsequent TRANSFER_RETRIEVE could then check whether the file is available from that cache before starting to provision and execute. One question is how to index the cache. Another (bigger) question is how and when to clear the cache. By being a special remote, remake is pretty much transparent to the user, but an explicit cache cleaning command seems the best option, especially if it is kept somewhere within the dataset directory. See also: #13

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions