Suppose a compute specification declares and produces multiple outputs. How would get behave when more than one of these output files are requested?
AFAIK git annex get acts sequentialy, ie. retrieves files one-by-one. This means that the remake special remote would be called upon as many times as there are files to get. Each time it is called, it would go through the steps of provision - execute - collect (each time producing multiple outputs but collecting only one). The specific call is TRANSFER RETRIEVE.
For example: if a specification has two outputs, and we want to get both of them, we end up provisioning the inputs twice, and running the computation twice.
This is inefficient, however, this is probably a necessary compromise: we can not hook deep inside the logic of get. Trading off storage space for compute time is inherent in compute-on-demand. If computations are quick, and provision can use local files (i.e. avoid slow re-downloads) this should not cause any noticeable discomfort. Nevertheless, this is a limitation.
Could that be improved upon? Probably not without different tradeoffs.
I re-read the external special remote protocol description and I don't think there is a better place for provisioning & execution than the currently used TRANSFER_RETRIEVE.
My one thought was that the outputs could be somehow cached. So instead of dropping the entire secondary worktree with its contents after collecting the requested file (key), we could store other outputs (if multiple outputs are declared) in some sort of temporary location - either within the dataset or in the temporary directory. Any subsequent TRANSFER_RETRIEVE could then check whether the file is available from that cache before starting to provision and execute. One question is how to index the cache. Another (bigger) question is how and when to clear the cache. By being a special remote, remake is pretty much transparent to the user, but an explicit cache cleaning command seems the best option, especially if it is kept somewhere within the dataset directory. See also: #13
Suppose a compute specification declares and produces multiple outputs. How would
getbehave when more than one of these output files are requested?AFAIK
git annex getacts sequentialy, ie. retrieves files one-by-one. This means that the remake special remote would be called upon as many times as there are files to get. Each time it is called, it would go through the steps of provision - execute - collect (each time producing multiple outputs but collecting only one). The specific call isTRANSFER RETRIEVE.For example: if a specification has two outputs, and we want to
getboth of them, we end up provisioning the inputs twice, and running the computation twice.This is inefficient, however, this is probably a necessary compromise: we can not hook deep inside the logic of
get. Trading off storage space for compute time is inherent in compute-on-demand. If computations are quick, and provision can use local files (i.e. avoid slow re-downloads) this should not cause any noticeable discomfort. Nevertheless, this is a limitation.Could that be improved upon? Probably not without different tradeoffs.
I re-read the external special remote protocol description and I don't think there is a better place for provisioning & execution than the currently used
TRANSFER_RETRIEVE.My one thought was that the outputs could be somehow cached. So instead of dropping the entire secondary worktree with its contents after collecting the requested file (key), we could store other outputs (if multiple outputs are declared) in some sort of temporary location - either within the dataset or in the temporary directory. Any subsequent
TRANSFER_RETRIEVEcould then check whether the file is available from that cache before starting to provision and execute. One question is how to index the cache. Another (bigger) question is how and when to clear the cache. By being a special remote, remake is pretty much transparent to the user, but an explicit cache cleaning command seems the best option, especially if it is kept somewhere within the dataset directory. See also: #13