Thoughts on getting many outputs

Suppose a compute specification declares and produces multiple outputs. How would `get` behave when more than one of these output files are requested?

AFAIK `git annex get` acts sequentialy, ie. retrieves files one-by-one. This means that the remake special remote would be called upon as many times as there are files to get. Each time it is called, it would go through the steps of provision - execute - collect (each time producing multiple outputs but collecting only one). The specific call is `TRANSFER RETRIEVE`.

For example: if a specification has two outputs, and we want to `get` both of them, we end up provisioning the inputs twice, and running the computation twice.

This is inefficient, however, this is probably a necessary compromise: we can not hook deep inside the logic of `get`. Trading off storage space for compute time is inherent in compute-on-demand. If computations are quick, and provision can use local files (i.e. avoid slow re-downloads) this should not cause any noticeable discomfort. Nevertheless, this is a limitation.

Could that be improved upon? Probably not without different tradeoffs.

I re-read the external special remote protocol description and I don't think there is a better place for provisioning & execution than the currently used `TRANSFER_RETRIEVE`. 

My one thought was that the outputs could be somehow cached. So instead of dropping the entire secondary worktree with its contents after collecting the requested file (key), we could store other outputs (if multiple outputs are declared) in some sort of temporary location - either within the dataset or in the temporary directory.  Any subsequent `TRANSFER_RETRIEVE` could then check whether the file is available from that cache before starting to provision and execute. One question is how to index the cache. Another (bigger) question is how and when to clear the cache. By being a special remote, remake is pretty much transparent to the user, but an explicit cache cleaning command seems the best option, especially if it is kept somewhere within the dataset directory. See also: #13 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts on getting many outputs #96

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Thoughts on getting many outputs #96

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions